JSON Schema - BigQuery - What would my schema be (

2019-09-07 01:06发布

问题:

Google Cloud BigQuery - Load Data via JSON file

I am trying to load data into BigQuery (JSON Newline Delimited) from a JSON file. I'm getting stuck trying to figure out what my "Schema" is/ which I should be using?

The JSON file, is a file of products.

What I have tried so far... NOTE: This is JUST for ONE product (of many), then it repeats the same pattern for all the other products:

[{"sku": INTEGER,"name": "STRING", "type": "STRING", "price": FLOAT, "upc": "INTEGER", "category": [{"id": "STRING", "name": "STRING"}, {"id": "STRING", "name": "STRING"}, {"id": "STRING", "name": "STRING"}, {"id": "STRING", "name": "STRING"}], "shipping": FLOAT, "description": "STRING", "manufacturer": "STRING", "model":"STRING", "url": "STRING","image": "STRING"}]

NOTE: the "image" key, is a URL to the image

UNLESS THERE IS ANOTHER WAY... Is there a way to load the JSON file into BigQuery and have it "auto-generate" the table and dataset?

回答1:

If you are using the CLI tool, then this is the schema for your data:

[{"name": "sku", "type": "INT64", "mode": "NULLABLE"},
   {"name": "name", "type": "STRING", "mode": "NULLABLE"},
   {"name": "type", "type": "STRING", "mode": "NULLABLE"},
   {"name": "price", "type": "FLOAT", "mode": "NULLABLE"},
   {"name": "upc", "type": "STRING", "mode": "NULLABLE"},
   {"fields":
     [{"name": "id", "type": "STRING", "mode": "NULLABLE"}, {"name": "name", "type": "STRING", "mode": "NULLABLE"}],
    "name": "category", "type": "RECORD", "mode": "REPEATED"},
   {"name": "shipping", "type": "FLOAT", "mode": "NULLABLE"},
   {"name": "description", "type": "STRING", "mode": "NULLABLE"},
   {"name": "manufacturer", "type": "STRING", "mode": "NULLABLE"},
   {"name": "model", "type": "STRING", "mode": "NULLABLE"},
   {"name": "url", "type": "STRING", "mode": "NULLABLE"},
   {"name": "image", "type": "STRING", "mode": "NULLABLE"}]

You can save it in a file (such as "schema.json") and then run the command:

bq load --source_format=NEWLINE_DELIMITED_JSON dataset_id.test_table path/to/json_data path/to/schema.json

Where path/to/json_data is the path for your data. It can be either a path in your local machine (such as /documents/so/jsondata.json or it can also be a path in Google Cloud Storage, such as gs://analyzes/json_data.json for instance).

The schema must be in your local machine or specified along the command line but in this operation, it has to be specified.

Now you mentioned in the comments for my first answer about a type of operation where BigQuery does not require schemas.

You can do so indeed only for federated tables, that is, tables that are created using as reference an external file (and these files usually are in GCS or Google Drive).

To do so, you first would have to have your JSON data in GCS for instance and then you'd have to create the table in BQ. Using the CLI, this command creates the federated table using as source the JSON data from GCS:

bq mk --external_table_definition=@NEWLINE_DELIMITED_JSON=gs://bucket_name/jsondata.json dataset_id.table_test 

This command line does not have the schema specified and BQ does its best to find what it should be given the data (I tested with your data and it worked just fine but I could use only legacy SQL afterwards).

Keep in mind though that this process is not guaranteed to work all the times and also you should use federated tables only if such tables meet the requirements for your project, otherwise it's easier and faster to load this data inside of BQ and then run queries from there. In the second reference that I suggested, you can read more about when it's best to use federated tables.



回答2:

SOLUTION:

First, Will's solution is correct, and it will/should work 99% of the time.

NOTE: the data-set (JSON file) has around 5,000 + products, 15,000 lines of code (depending how you have it formatted)

However, with this particular dataset in conjunction with BigQuery (the way it wants it), for some reason (not sure the exact reason) would not work as expected. I did narrow it down to what (I think) was causing the error. I believe it was the "catagory": ["id":1234,"name":"ipod", etc. ] section for each one of the products. BigQuery seems to be pretty "fussy" with nested JSON data, and you need to do it just right and via Command-Line ONLY (no Web UI).

As soon as I deleted the "category" section (along with the ids/names) for the data-set and schema, I was able to get it to load the data just fine.

This was of course, only a very small sample of the dataset, as I wasn't going to sift through 5,000 products, deleting each "catagory" section.

SOLUTION - use a CSV file (recommend being somewhat familiar with MS Excel):

STEPS I DID:

NOTE: Don't do a right-click "Open With" (in Windows) on the file.

NOTE: I imported the whole (5,000 products) data.json file as delimited.

  • I opened Excel (v2016 in my case) FIRST (not the file) and created a BLANK spreadsheet.
  • Click on the "Data Tab" on the "Ribbon" at the top. Select the option "From Text".
  • Change the view/file type to see "All Files (".")" so you can see your "JSON" file.
  • Make sure "Delimited" is selected, and 437:OEM United States (unless you need it to be something else), click Next.
  • UN-select "Tab", and select "Comma", click Next.
  • Then, your going to want to select each column (inside the wizard) and change the "Column data format" to "Text", etc. When you're done, click Finish.

It didn't come in as perfect as I wanted (usually never does).

GETTING RID OF UNECESSARY CHARACTERS ([ { : " ") AND TEXT:

  • Did a "Find and Replace" (Ctrl+F for Windows), Replace All for each of the characters/text I didn't want or need. To do a "Delete", just leave the "Replace" BLANK (so it replaces the text/character with nothing).

  • Then, I filtered and sorted my data. This allowed me to find the columns I didn't want, which was "catagory", "ID", and "NAME" (that corresponds to ID, NOT NAME of product).

  • Once you get your data how you want it, do a "SAVE AS". Change "Save as type:" to CSV UTF-8 (Comma delimited)(*.csv), and name your file (to anything) myfile.csv

  • Then you can just use the Google BigQuery Web UI (or command-line if you want). for "Location/ File Upload" you'll need to store it on either Cloud Storage or Google Drive if the file is too big. Otherwise, just upload from your local computer.

ALSO!!!! DON'T FORGET!!!

  • Under Options, select "Comma", and put a "1" in "Header rows to skip" (this will be the name at the top of each of your columns. you don't want to import the column names as data. They are just to help keep things straightened out and sorted.