Multiple CoGroupByKey with same key apache beam

2019-07-28 22:17发布

I have a situation where I need to join the main data stream (1.5TB) in my pipeline to 2 different datasets (4.92GB and 17.35GB). The key that I use to do the CoGroupByKey for both are the same. Is there a way to avoid reshuffling the left side of the join after the first completes? Currently I am just leaving the output as a KV>. This seems to be better than emitting each element piecewise after the first join, but the second groupByKey still seems to be taking a lot longer than I would expect. I was going to start looking into pulling apart CoGroupByKey to see if I can ignore grouping one side, but I really feel safer not going down to that level at this point.

This was prior to keeping Iterables grouped after the first join

标签： google-cloud-dataflow dataflow apache-beam

1条回答

我命由我不由天

2楼-- · 2019-07-28 22:59

Have you considered accessing the smaller datasets as View.asMap() or View.asMultimap() side inputs when processing the main input? The Dataflow runner has an optimized implementation of map and multimap side inputs which performs key lookups efficiently without loading the whole data into memory.

0人赞添加讨论(0) 举报

Multiple CoGroupByKey with same key apache beam

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间