I'm trying to extract data from 2 tables in BigQuery, then join it by CoGroupByKey.
Although the output of BigQuery is PCollection<TableRow>
, CoGroupByKey
requires PCollection<KV<K,V>>
.
How can I convert from PCollection<TableRow>
to PCollection<KV<K,V>>
?
相关问题
- Why do Dataflow steps not start?
- Apache beam DataFlow runner throwing setup error
- Apply Side input to BigQueryIO.read operation in A
- Reading BigQuery federated table as source in Data
- CloudDataflow can not use “google.cloud.datastore”
相关文章
- Kafka to Google Cloud Platform Dataflow ingestion
- How to run dynamic second query in google cloud da
- Beam/Google Cloud Dataflow ReadFromPubsub Missing
- Cloud Dataflow failure recovery
- KafkaIO checkpoint - how to commit offsets to Kafk
- Validating rows before inserting into BigQuery fro
- Can Dataflow sideInput be updated per window by re
- Computing GroupBy once then passing it to multiple
CoGroupByKey
needs to know which key toCoGroup
by - this is theK
inKV<K, V>
, and theV
is the value associated with this key in this collection. The result of co-grouping several collections will give you, for each key, all of the values with this key in each collection.So, you need to convert both of your
PCollection<TableRow>
toPCollection<KV<YourKey, TableRow>>
whereYourKey
is the type of key on which you want to join them, e.g. in your case perhaps it might beString
, orInteger
, or something else.The best transform to do the conversion is probably
WithKeys
. E.g. here's a code sample converting aPCollection<TableRow>
to aPCollection<KV<String, TableRow>>
keyed by a hypotheticaluserId
field of typeString
: