Processing with State and Timers

2019-07-25 07:37发布

问题:

Are there any guidelines or limitations for using stateful processing and timers with the Beam Dataflow runner (as of v2.1.0)? Things such as limitations on the size of state or frequency of updates etc.? The candidate streaming pipeline would use state and timers extensively for user session state, with Bigtable as durable storage.

回答1:

Here is some general advice for your use case

  • Please aggregate multiple elements then set a timer.
  • Please don't create a timer per element, which would be excessive.
  • Try and aggregate state, instead of accumulating large amount of state. I.e. aggregate as a sum and count, instead of storing every number when trying to compute a mean.
  • Please consider session windows for this use case.
  • In dataflow, state is not supported for merging windows. It is for beam.
  • Please use state according to your access pattern, i.e. BagState for blind writes.

Here is an informative blog post with some more info on state "Stateful processing with Apache Beam."