Multiple CoGroupByKey with same key apache beam

2019-07-28 22:25发布

站内文章 / 移动开发

28 0

祖国的老花朵

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a situation where I need to join the main data stream (1.5TB) in my pipeline to 2 different datasets (4.92GB and 17.35GB). The key that I use to do the CoGroupByKey for both are the same. Is there a way to avoid reshuffling the left side of the join after the first completes? Currently I am just leaving the output as a KV>. This seems to be better than emitting each element piecewise after the first join, but the second groupByKey still seems to be taking a lot longer than I would expect. I was going to start looking into pulling apart CoGroupByKey to see if I can ignore grouping one side, but I really feel safer not going down to that level at this point.

This was prior to keeping Iterables grouped after the first join

回答1:

Have you considered accessing the smaller datasets as View.asMap() or View.asMultimap() side inputs when processing the main input? The Dataflow runner has an optimized implementation of map and multimap side inputs which performs key lookups efficiently without loading the whole data into memory.