I have the following config that works fine for loading a bunch of files into BigQuery:

config= {
  'configuration'=> {
    'load'=> {
      'sourceUris'=> 'gs://my_bucket/my_files_*',
      'schema'=> {
        'fields'=> fields_array
      },
      'schemaUpdateOptions' => [{ 'ALLOW_FIELD_ADDITION'=> true}],  
      'destinationTable'=> {
        'projectId'=> 'my_project',
        'datasetId'=> 'my_dataset',
        'tableId'=> 'my_table'
      },
      'sourceFormat' => 'NEWLINE_DELIMITED_JSON',
      'createDisposition' => 'CREATE_IF_NEEDED',
      'writeDisposition' => 'WRITE_TRUNCATE',
      'maxBadRecords'=> 0,
    }
  },
}

This is then executed with the following where client is pre-initialised:

result = client.execute(
  api_method: big_query.jobs.insert,
  parameters: { projectId: 'my_project', datasetId: 'my_dataset' },
  body_object: config
)

I am now trying to write the equivalent to create an external / federated data source instead of loading the data. I need to do this to effectively create staging tables for ETL purposes. I have successfully done this using the BigQuery UI but need to run in code as it will eventually be a daily automated process. I've having a bit of trouble with the API docs and can't find any good examples to refer to. Can anyone help? Thanks in advance!

标签： ruby-on-rails ruby google-cloud-platform google-bigquery google-cloud-storage

3条回答

Evening l夕情丶

2楼-- · 2019-08-26 22:25

By creating an external data source, do you mean create a table that refers to an external data source? In this case you can use bigquery.tables.insert and fill out the externalDataConfiguraiton. The table can then be used in queries to read from the external data source.

If you only want to use the external data source in one query, you can attach a temporary external table with the query, by putting the table definition to tableDefinitions. In command line it looks like this:

bq query --external_table_definition=avroTable::AVRO=gs://path-to-avro 'SELECT * FROM avroTable'

0人赞添加讨论(0) 举报

\"骚年 ilove

3楼-- · 2019-08-26 22:30

Use idiomatic Cloud libraries when possible

Use the BigQuery module in the idiomatic Ruby client for GCP, which is Generally Available, instead of google-api-ruby-client, which is both in "maintenance mode only" and "alpha". You can find this recommendation here and here.

Authentication:

You can define project and access using environment variables.

How to create an External Data Source object

This is an example of creating an External Data Source with bigquery.external. I have slightly modified it to add relevant configurations from your solution.

bigquery = Google::Cloud::Bigquery.new

json_url = "gs://my_bucket/my_files_*"
json_table = bigquery.external csv_url do |json|
  json.autodetect = true
  json.format = "json"
  json.max_bad_records = 10
end

The object configuration methods are here. For example: autodetect, max_bad_records, urls, etc.

How to query it:

data = bigquery.query "SELECT * FROM my_ext_table",
                      external: { my_ext_table: json_table }

data.each do |row|
  puts row[:name]
end

Note: Also, both writeDisposition and createDisposition are only used for load/copy/query jobs which modify permanent BigQuery tables and wouldn't make much sense for an External Data Source. In fact they don't appear neither in the REST API reference nor in "Try this API" section for externalDataConfiguration.

0人赞添加讨论(0) 举报

仙女界的扛把子

4楼-- · 2019-08-26 22:34

For anyone attempting the same, here's what I used to get it working. There are not many working examples online and the docs take some deciphering, so hope this helps someone else!

config= {
  "kind": "bigquery#table",
  "tableReference": {
    "projectId": 'my_project',
    "datasetId": 'my_dataset',
    "tableId": 'my_table'
  },
  "externalDataConfiguration": {
    "autodetect": true,
    "sourceUris": ['gs://my_bucket/my_files_*'],
    'sourceFormat' => 'NEWLINE_DELIMITED_JSON',
    'maxBadRecords'=> 10,
  }
}

The documentation for externalDataConfiguration can be found in the BigQuery REST API reference and "Try this API" section for bigquery.tables.insert.

Then as pointed out in Hua Zhang's answer you run bigquery.tables.insert instead of bigquery.jobs.insert

result = client.execute(
    api_method: big_query.tables.insert,
    parameters: { projectId: my_project, datasetId: my_dataset },
    body_object: config
)

0人赞添加讨论(0) 举报

API config for BigQuery Federated Data Source

Use idiomatic Cloud libraries when possible

Authentication:

How to create an External Data Source object

How to query it:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间