Right way to handle one-to-many stages in Dataflow

2019-07-25 17:01发布

I have a (Java) batch pipeline that has follow the following pattern:

(FileIO)
(ExtractText > input=1 file, output=millions of lines of text)
(ProcessData)

The ProcessData stage contains slow parts (matching data against big whitelists) and needs to be scaled on several workers, which should not be an issue since it only contains DoFns. However it would seem that my one-to-many stage forces all the outputs to be processed only by one worker (instantiating more workers makes them all idle except one, or be downscaled if autoscaling is enabled).

Based on other stackoverflow entries, I have tried shuffling via Reshuffle.viaRandomKey(). This does not work because Reshuffle contains a GroupByKey which loads all the result in memory, causing OOM, even if I window it beforehand via Window.<String>into(FixedWindows.of(Duration.standardSeconds(1)))

Another option would be to create a CustomSource to replace the first two stages, but I find this method inadequate because 1) the documentation of custom sources is severely lacking 2) it takes more time and code to implement 3) this one-to-many issue could well be encountered in the middle of a pipeline, where I couldn't create custom sources.

How should I handle one-to-many stages in a Dataflow pipeline ?

标签： java performance one-to-many google-cloud-dataflow apache-beam

0条回答

Right way to handle one-to-many stages in Dataflow

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间