How to import a json from a file on cloud storage

2019-02-23 21:42发布

问题:

I am trying to import a file (json.txt) from cloud storage to Bigquery via the api and have errors thrown. When this is done via the web ui, it works and has no errors (I even set maxBadRecords=0). Could someone please tell me what Im doing incorrectly here? Is the code wrong, or do I need to change some settings in Bigquery somewhere?

The file is a plain text utf-8 file with contents as follows: Ive kept to the docs on bigquery and json imports.

{"person_id":225,"person_name":"John","object_id":1}
{"person_id":226,"person_name":"John","object_id":1}
{"person_id":227,"person_name":"John","object_id":null}
{"person_id":229,"person_name":"John","object_id":1}

and on importing the job throws the following error: "Value cannot be converted to expected type." for every single line.

    {
    "reason": "invalid",
    "location": "Line:15 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
   {
    "reason": "invalid",
    "location": "Line:16 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
   {
    "reason": "invalid",
    "location": "Line:17 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
  {
    "reason": "invalid",
    "location": "Line:18 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
   {
    "reason": "invalid",
    "message": "Too many errors encountered. Limit is: 10."
   }
  ]
 },
 "statistics": {
  "creationTime": "1384484132723",
  "startTime": "1384484142972",
  "endTime": "1384484182520",
  "load": {
   "inputFiles": "1",
   "inputFileBytes": "960",
   "outputRows": "0",
   "outputBytes": "0"
  }
 }
}

The file can be accessed here: http://www.sendspace.com/file/7q0o37

and my code and schema are as follows:

def insert_and_import_table_in_dataset(tar_file, table, dataset=DATASET)
config= {
  'configuration'=> {
      'load'=> {
        'sourceUris'=> ["gs://test-bucket/#{tar_file}"],
        'schema'=> {
          'fields'=> [
            { 'name'=>'person_id', 'type'=>'INTEGER', 'mode'=> 'nullable'},
            { 'name'=>'person_name', 'type'=>'STRING', 'mode'=> 'nullable'},
            { 'name'=>'object_id',  'type'=>'INTEGER', 'mode'=> 'nullable'}
          ]
        },
        'destinationTable'=> {
          'projectId'=> @project_id.to_s,
          'datasetId'=> dataset,
          'tableId'=> table
        },
        'sourceFormat' => 'NEWLINE_DELIMITED_JSON',
        'createDisposition' => 'CREATE_IF_NEEDED',
        'maxBadRecords'=> 10,
      }
    },
  }

result = @client.execute(
  :api_method=> @bigquery.jobs.insert,
  :parameters=> {
     #'uploadType' => 'resumable',          
      :projectId=> @project_id.to_s,
      :datasetId=> dataset},
  :body_object=> config
)

# upload = result.resumable_upload
# @client.execute(upload) if upload.resumable?

puts result.response.body
json = JSON.parse(result.response.body)    
while true
  job_status = get_job_status(json['jobReference']['jobId'])
  if job_status['status']['state'] == 'DONE'
    puts "DONE"
    return true
  else
   puts job_status['status']['state']
   puts job_status 
   sleep 5
  end
end
end

Could someone please tell me what I am doing wrong? What do I fix and where?

Also at some point in the future, I expect to be using compressed files and importing from them- is the "tar.gz" ok for that or do I need to make it a ".gz" only?

Thank you in advance for all help. Appreciate it.

回答1:

You're getting hit by the same thing that a lot of people (including me) have gotten hit by -- you are importing a json file but not specifying an import format, so it defaults to csv.

If you set configuration.load.sourceFormat to NEWLINE_DELIMITED_JSON you should be good to go.

We've got a bug to make it harder to do or at least be able to detect when the file is the wrong type, but I'll bump the priority.