When does Dataflow acknowledge a message of batche

2020-02-12 07:07发布

问题:

There has been a question on this topic, the answer said "The acknowledgement will be made once the message is durable persisted somewhere in the Dataflow pipeline.".

Conceptually, that makes sense, but I am not sure how Dataflow is capable of tracking a message after it has been deserialized and transformed in the pipeline before its payload is persisted.

In our case, the PubSub message contains a batch of items. After the message is received and deserialized, we broken down the batch for processing. Eventually, an item in the batch could be either discarded or committed to Datastore depending on its timestamp.

How does the acknowledgement work in this situation?

回答1:

Dataflow executes your code in bundles. After successful execution each bundle is committed to avoid re-execution on successfully processed elements. Bundles are not necessarily committed between every step in the pipeline. See the description of fusion optimization for details about when PCollections are materialized and committed.

For PubSub, messages that were read as part of a bundle will be acknowledged as part of committing the completion of that bundle. This means if you look at the PubSub read step, and any ParDos after it, these will be executed (and committed) together.

Adding a GroupByKey after the PubSub read allows messages to be acknowledged to PubSub as soon as the bundles are committed to the GroupByKey.