Convert from PCollection to PCollection<

2019-09-15 17:59发布

问题:

I'm trying to extract data from 2 tables in BigQuery, then join it by CoGroupByKey. Although the output of BigQuery is PCollection<TableRow>, CoGroupByKey requires PCollection<KV<K,V>>. How can I convert from PCollection<TableRow> to PCollection<KV<K,V>>?

回答1:

CoGroupByKey needs to know which key to CoGroup by - this is the K in KV<K, V>, and the V is the value associated with this key in this collection. The result of co-grouping several collections will give you, for each key, all of the values with this key in each collection.

So, you need to convert both of your PCollection<TableRow> to PCollection<KV<YourKey, TableRow>> where YourKey is the type of key on which you want to join them, e.g. in your case perhaps it might be String, or Integer, or something else.

The best transform to do the conversion is probably WithKeys. E.g. here's a code sample converting a PCollection<TableRow> to a PCollection<KV<String, TableRow>> keyed by a hypothetical userId field of type String:

PCollection<TableRow> rows = ...;
PCollection<KV<String, TableRow>> rowsKeyedByUser = rows
    .apply(WithKeys.of(new SerializableFunction<TableRow, String>() {
  @Override
  public String apply(TableRow row) {
    return (String)row.get("userId");
  }
}));