I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK.
Looking at the BigQuery JSON API, to create a day partitioned table one needs to pass in a
"timePartitioning": { "type": "DAY" }
option, but the com.google.cloud.dataflow.sdk.io.BigQueryIO interface only allows specifying a TableReference.
I thought that maybe I could pre-create the table, and sneak in a partition decorator via a BigQueryIO.Write.toTableReference lambda..? Is anyone else having success with creating/writing partitioned tables via Dataflow?
This seems like a similar issue to setting the table expiration time which isn't currently available either.
I believe it should be possible to use the partition decorator when you are not using streaming. We are actively working on supporting partition decorators through streaming. Please let us know if you are seeing any errors today with non-streaming mode.
If you pass the table name in
table_name_YYYYMMDD
format, then BigQuery will treat it as a sharded table, which can simulate partition table features. Refer the documentation: https://cloud.google.com/bigquery/docs/partitioned-tablesI have written data into bigquery partitioned tables through dataflow. These writings are dynamic as-in if the data in that partition already exists then I can either append to it or overwrite it.
I have written the code in Python. It is a batch mode write operation into bigquery.
It works fine.
The approach I took (works in the streaming mode, too):
Convert the window into the table/partition name
Setting the window based on the incoming data, the End Instant can be ignored, as the start value is used for setting the partition:
Setting the table partition dynamically:
Is there a better way of achieving the same outcome?
Apache Beam version 2.0 supports sharding BigQuery output tables out of the box.
As Pavan says, it is definitely possible to write to partition tables with Dataflow. Are you using the
DataflowPipelineRunner
operating in streaming mode or batch mode?The solution you proposed should work. Specifically, if you pre-create a table with date partitioning set up, then you can use a
BigQueryIO.Write.toTableReference
lambda to write to a date partition. For example: