Suppose I have a PCollection<Foo>
and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo
.
How can I do this using the Apache Beam BigQueryIO
API?
Suppose I have a PCollection<Foo>
and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo
.
How can I do this using the Apache Beam BigQueryIO
API?
This is possible using a feature recently added to
BigQueryIO
in Apache Beam.Depending on whether the input
PCollection<Foo>
is bounded or unbounded, under the hood this will either create multiple BigQuery import jobs (one or more per table depending on amount of data), or it will use the BigQuery streaming inserts API.The most flexible version of the API uses
DynamicDestinations
, which allows you to write different values to different tables with different schemas, and even allows you to use side inputs from the rest of the pipeline in all of these computations.Additionally, BigQueryIO has been refactored into a number of reusable transforms that you can yourself combine to implement more complex use cases - see files in the source directory.
This feature will be included in the first stable release of Apache Beam and into the next release of Dataflow SDK (which will be based on the first stable release of Apache Beam). Right now you can use this by running your pipeline against a snapshot of Beam at HEAD from github.