I am trying to stream some data from google PubSub into BigQuery using a python dataflow. For testing purposes I have adapted the following code https://github.com/GoogleCloudPlatform/DataflowSDK-examples/blob/master/python/dataflow_examples/cookbook/bigquery_schema.py into a streaming pipeline by setting
options.view_as(StandardOptions).streaming = True
So then I changed the record_ids pipeline to read from Pub/Sub
# ADDED THIS
lines = p | 'Read PubSub' >> beam.io.ReadStringsFromPubSub(INPUT_TOPIC) | beam.WindowInto(window.FixedWindows(15))
# CHANGED THIS # record_ids = p | 'CreateIDs' >> beam.Create(['1', '2', '3', '4', '5'])
record_ids = lines | 'Split' >> (beam.FlatMap(split_fn).with_output_types(unicode))
records = record_ids | 'CreateRecords' >> beam.Map(create_random_record)
records | 'Write' >> beam.io.Write(
beam.io.BigQuerySink(
OUTPUT,
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
Note: I have been whitelisted by google to run the code (in alpha)
Now when I try it I have an error
Workflow failed. Causes: (f215df7c8fcdbb00): Unknown streaming sink: bigquery
You can find the full code here: https://github.com/marcorigodanzo/gcp_streaming_test/blob/master/my_bigquery_schema.py
I think that this has to do with the pipeline being now of type streaming, can anyone please tell me how to do a bigQuery write in a streaming pipeline?
Beam Python does not support writing to BigQuery from streaming pipelines. For now, you will need to use Beam Java - you can use respectively
PubsubIO.readStrings()
andBigQueryIO.writeTableRows()
.