I'm trying to extract data from 2 tables in BigQuery, then join it by CoGroupByKey.
Although the output of BigQuery is PCollection<TableRow>
, CoGroupByKey
requires PCollection<KV<K,V>>
.
How can I convert from PCollection<TableRow>
to PCollection<KV<K,V>>
?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
CoGroupByKey
needs to know which key to CoGroup
by - this is the K
in KV<K, V>
, and the V
is the value associated with this key in this collection. The result of co-grouping several collections will give you, for each key, all of the values with this key in each collection.
So, you need to convert both of your PCollection<TableRow>
to PCollection<KV<YourKey, TableRow>>
where YourKey
is the type of key on which you want to join them, e.g. in your case perhaps it might be String
, or Integer
, or something else.
The best transform to do the conversion is probably WithKeys
. E.g. here's a code sample converting a PCollection<TableRow>
to a PCollection<KV<String, TableRow>>
keyed by a hypothetical userId
field of type String
:
PCollection<TableRow> rows = ...;
PCollection<KV<String, TableRow>> rowsKeyedByUser = rows
.apply(WithKeys.of(new SerializableFunction<TableRow, String>() {
@Override
public String apply(TableRow row) {
return (String)row.get("userId");
}
}));