How to insert/append unstructured data to bigquery

2020-02-15 04:25发布

Background

I want to insert/append newline formatted JSON into bigquery table through python client API.

Eg:

{"name":"xyz",mobile:xxx,location:"abc"}
{"name":"xyz",mobile:xxx,age:22}

Issue is, all fields in a row are optional and there is no fixed defined schema for the data.

Query

I have read that we can use Federated tables, which supports autoschema detection.

However, I am looking for a feature, that would automatically detect schema from data,create tables accordingly and even adjust the table schema if any extra columns/keys appear in data instead of creating new table.

Would this be possible using python client API.

1条回答
家丑人穷心不美
2楼-- · 2020-02-15 04:41

You can use autodetect with BigQuery load API, i.e. your example using bq cli tool will look like following:

~$ cat /tmp/x.json
{"name":"xyz","mobile":"xxx","location":"abc"}
{"name":"xyz","mobile":"xxx","age":"22"}

~$ bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON tmp.x /tmp/x.json
Upload complete.

~$ bq show tmp.x
Table tmp.x

   Last modified          Schema          Total Rows   Total Bytes   Expiration  
 ----------------- --------------------- ------------ ------------- ------------ 
  16 Aug 08:23:35   |- age: integer       2            33                        
                    |- location: string                                          
                    |- mobile: string                                            
                    |- name: string                                              


~$ bq query "select * from tmp.x"

+------+----------+--------+------+
| age  | location | mobile | name |
+------+----------+--------+------+
| NULL | abc      | xxx    | xyz  |
|   22 | NULL     | xxx    | xyz  |
+------+----------+--------+------+

Update: If later you need to add additional fields, you can use schema_update_option to allow new fields. Alas it doesn't yet work with autodetect, so you need to provide new schema explicitly to the load API:

~$ cat /tmp/x1.json 
{"name":"abc","mobile":"yyy","age":"25","gender":"male"}

~$ bq load --schema=name:STRING,age:INTEGER,location:STRING,mobile:STRING,gender:STRING --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON tmp.x /tmp/x1.json
Upload complete.

~$ bq show tmp.x
Table tmp.x

   Last modified          Schema          Total Rows   Total Bytes   Expiration  
 ----------------- --------------------- ------------ ------------- -----------
  19 Aug 10:43:09   |- name: string       3            57                        
                    |- age: integer                                              
                    |- location: string                                          
                    |- mobile: string                                            
                    |- gender: string                                            


~$ bq query "select * from tmp.x"
status: DONE   
+------+------+----------+--------+--------+
| name | age  | location | mobile | gender |
+------+------+----------+--------+--------+
| abc  |   25 | NULL     | yyy    | male   |
| xyz  | NULL | abc      | xxx    | NULL   |
| xyz  |   22 | NULL     | xxx    | NULL   |
+------+------+----------+--------+--------+
查看更多
登录 后发表回答