Convert from PCollection to PCollection<

2019-09-15 17:37发布

I'm trying to extract data from 2 tables in BigQuery, then join it by CoGroupByKey. Although the output of BigQuery is PCollection<TableRow>, CoGroupByKey requires PCollection<KV<K,V>>. How can I convert from PCollection<TableRow> to PCollection<KV<K,V>>?

1条回答
SAY GOODBYE
2楼-- · 2019-09-15 18:09

CoGroupByKey needs to know which key to CoGroup by - this is the K in KV<K, V>, and the V is the value associated with this key in this collection. The result of co-grouping several collections will give you, for each key, all of the values with this key in each collection.

So, you need to convert both of your PCollection<TableRow> to PCollection<KV<YourKey, TableRow>> where YourKey is the type of key on which you want to join them, e.g. in your case perhaps it might be String, or Integer, or something else.

The best transform to do the conversion is probably WithKeys. E.g. here's a code sample converting a PCollection<TableRow> to a PCollection<KV<String, TableRow>> keyed by a hypothetical userId field of type String:

PCollection<TableRow> rows = ...;
PCollection<KV<String, TableRow>> rowsKeyedByUser = rows
    .apply(WithKeys.of(new SerializableFunction<TableRow, String>() {
  @Override
  public String apply(TableRow row) {
    return (String)row.get("userId");
  }
}));
查看更多
登录 后发表回答