I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key.
Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below).
I wonder if this is supposed to work efficiently in DataFlow. If not, what is a recommended workaround in Python SDK? Is there an efficient way to have multiple Map or Write transformations taking results of the same GroupBy? In my case, I observe DataFlow scale to the max number of workers at 5% utilization and make no progress at the steps following the GroupBy as described in this question.
Sample code. For simplicity, only 2 transformations are shown.
# Group by key once.
items_by_key = raw_items | GroupByKey()
# Write groupped items to a file.
(items_by_key | FlatMap(format_item) | WriteToText(path))
# Run another transformation over the same group.
features = (items_by_key | Map(extract_features))
Feeding output of a single GroupByKey
step into multiple transforms should work fine. But the amount of parallelization you can get depends on the total number of keys available in the original GroupByKey
step. If any one of the downstream steps are high fanout, consider adding a Reshuffle step after those steps which will allow Dataflow to further parallelize execution.
For example,
pipeline | Create([<list of globs>]) | ParDo(ExpandGlobDoFn()) | Reshuffle() | ParDo(MyreadDoFn()) | Reshuffle() | ParDo(MyProcessDoFn())
Here,
ExpandGlobDoFn
: expands input globs and generates files
MyReadDoFn
: reads a given file
MyProcessDoFn
: processes an element read from a file
I used two Reshuffle
s here (note that Reshuffle
has a GroupByKey
in it) to allow (1) parallelizing reading of files from a given glob (2) parallelizing processing of elements from a given file.
Based on my experience in troubleshooting this SO question, reusing GroupBy output in more than one transformation can make your pipeline extremely slow. At least this was my experience with Apache Beam SDK 2.11.0 for Python.
Common sense told me that branching out from a single GroupBy in the execution graph should make my pipeline run faster. After 23 hours of running on 120+ workers, the pipeline was not able to make any significant progress. I tried adding reshuffles, using a combiner where possible and disabling the experimental shuffling service.
Nothing helped until I split the pipeline into two ones. The first pipeline computes the GroupBy and stores it in a file (I need to ingest it "as is" to the DB). The second reads the file with a GroupBy output, reads additional inputs and run further transformations. The result - all transformations successfully finished under 2 hours. I think if I just duplicated the GroupBy in my original pipeline, I would have probably achieved the same results.
I wonder if this is a bug in the DataFlow execution engine or the Python SDK, or it works as intended. If it is by design, then at least it should be documented, and the pipeline like this should not be accepted when submitted, or there should be a warning.
You can spot this issue by looking at the 2 branches coming out of the "Group keywords" step. It looks like the solution is to rerun GroupBy for each branch separately.