Get TableSchema from BigQuery result PCollection<T

When I run a query in BigQuery Web UI, the results are displayed in a table where both name and type of each field are known (even when a field is a result of COUNT(), AVG(), ... operation, type of field is known, of course). The results can be then directly exported as a table/json/csv.

My question is, when I retrieve query results in my java project, e.g. with a query:

String query =  "SELECT nationality, COUNT(DISTINCT personID) AS population 
                 FROM Dataset.Table 
                 GROUP BY nationality";

PCollection<TableRow> result = p.apply(BigQueryIO.Read.fromQuery(query));

... is it possible to obtain the schema of TableRow in result PCollection, without explicitly defining it? I think it must be possible, since it's possible with the same query when using BigQuery Web UI. But I can't figure out how to do it ...

TableSchema schema =  // function of PCollection<TableRow> result ?

result.apply(BigQueryIO.Write
                .named("Write Results Table")
                .to(getTableReference(tableName))
                .withSchema(schema));

That way query results could be always automatically exported/saved into a new table (only the table name then needs to be explicitly provided).

Any ideas? Any help would be appreciated :)

标签： java google-bigquery google-cloud-dataflow

2条回答

smile是对你的礼貌

2楼-- · 2019-01-28 10:36

Conceptually - you should write the function which will iterate thru all cells of given TableRow and for each - get name and type and while iterating you will create respective TableSchema.
For simple schemas, I would expect, it should be relatively easy.
For schemas with records, repeated, etc. this could be more complex

0人赞添加讨论(0) 举报

做自己的国王

3楼-- · 2019-01-28 10:48

Unfortunately, Dataflow SDK doesn't expose a schema returned by BigQuery via Dataflow's BigQueryIO API. There's no "good" workaround within the Dataflow API alone.

Defining a schema manually is one workaround.

Alternatively, you could make a separate query to BigQuery directly via jobs: query at pipeline construction time, whose result can then be passed to BigQueryIO.Write transform. This may incur additional cost, but that can probably be mitigated by altering the query slightly to reduce the amount of data processed. Correctness of the output is not relevant, since you'd be storing the schema only.

0人赞添加讨论(0) 举报

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间