Iterative processing in Dataflow

2019-08-15 07:33发布

问题:

As shown here Dataflow pipelines are represented by a fixed DAG. I'm wondering if it's possible to implement a pipeline where the processing proceeds until a dynamically evaluated condition is satisfied based on the data computed so far.

Here's some pseudo code to illustrate what I'd like to implement:

    PCollection pco = null
    while(true):
        pco = pco.apply(someTransform())
        if (conditionSatisfied(pco)):
            break
    pco.Write()

回答1:

It seems like you really want iterative computations. Right now Dataflow does not provide support for that, but we are aware that it is a very important use case and we are working on finding the right set of APIs to express it.

For now your workarounds are:

  • Iteratively run whole pipelines (run pipeline, inspect output, run again if the condition is not satisfied, etc). This has the obvious downside of pipeline setup and teardown overhead.
  • Build a pipeline with a hard-coded number of iterations by .apply()'ing in a loop unconditionally, then run the whole pipeline.
  • A combination of the two, e.g. run fixed 5-iteration pipelines until you're satisfied with the result.