I have a PCollection P1
that contains a field of ID's . I want to take the complete ID's column from the PCollection as a list and pass this value to a BigQuery query for filtering one BigQuery table.
What would be the fastest and most optimized way for doing this?
I'm new to Dataflow and BigData. Can any one give some hints on this?
Thanks!
For what I understood from your question you want to build the SQL statement given the IDs you have in
P1
. This is one example of how you can achieve this:Results:
The operation
beam.combiners.ToList()
transforms your whole PCollection data into a single list (which I used later on to inject in the SQL placeholder).You can now use the SQL in the file
results.csv-00000-to-000001
to run this query against BQ.I'm not sure if it's possible to run this query directly in the PCollection though (something like
(p | all transformations | beam.io.Write(beam.io.BigQuerySink(result sql))
). I suppose reading from the end result file and then issuing the query against BQ would be the best approach here.