Right way to handle one-to-many stages in Dataflow

2019-07-25 16:30发布

问题:

I have a (Java) batch pipeline that has follow the following pattern:

(FileIO)
(ExtractText > input=1 file, output=millions of lines of text)
(ProcessData)

The ProcessData stage contains slow parts (matching data against big whitelists) and needs to be scaled on several workers, which should not be an issue since it only contains DoFns. However it would seem that my one-to-many stage forces all the outputs to be processed only by one worker (instantiating more workers makes them all idle except one, or be downscaled if autoscaling is enabled).

Based on other stackoverflow entries, I have tried shuffling via Reshuffle.viaRandomKey(). This does not work because Reshuffle contains a GroupByKey which loads all the result in memory, causing OOM, even if I window it beforehand via Window.<String>into(FixedWindows.of(Duration.standardSeconds(1)))

Another option would be to create a CustomSource to replace the first two stages, but I find this method inadequate because 1) the documentation of custom sources is severely lacking 2) it takes more time and code to implement 3) this one-to-many issue could well be encountered in the middle of a pipeline, where I couldn't create custom sources.

How should I handle one-to-many stages in a Dataflow pipeline ?