How I can convert PCollection to a list in python

2020-03-06 12:05发布

I have a PCollection P1 that contains a field of ID's . I want to take the complete ID's column from the PCollection as a list and pass this value to a BigQuery query for filtering one BigQuery table.

What would be the fastest and most optimized way for doing this?

I'm new to Dataflow and BigData. Can any one give some hints on this?

Thanks!

1条回答
叛逆
2楼-- · 2020-03-06 12:25

For what I understood from your question you want to build the SQL statement given the IDs you have in P1. This is one example of how you can achieve this:

sql = """select ID from `table` WHERE ID IN ({})"""
with beam.Pipeline(options=StandardOptions()) as p:
         (p | 'Create' >> beam.Create(['1', '2', '3']) 
            | 'Combine' >> beam.combiners.ToList()
            | 'Build SQL' >> beam.Map(lambda x: sql.format(','.join(map(lambda x: '"' + x + '"', x))))
            | 'Save' >> beam.io.WriteToText('results.csv'))

Results:

select ID from `table` WHERE ID IN ("1","2","3")

The operation beam.combiners.ToList() transforms your whole PCollection data into a single list (which I used later on to inject in the SQL placeholder).

You can now use the SQL in the file results.csv-00000-to-000001 to run this query against BQ.

I'm not sure if it's possible to run this query directly in the PCollection though (something like (p | all transformations | beam.io.Write(beam.io.BigQuerySink(result sql)) ). I suppose reading from the end result file and then issuing the query against BQ would be the best approach here.

查看更多
登录 后发表回答