Are there any guidelines or limitations for using stateful processing and timers with the Beam Dataflow runner (as of v2.1.0)? Things such as limitations on the size of state or frequency of updates etc.? The candidate streaming pipeline would use state and timers extensively for user session state, with Bigtable as durable storage.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
Here is some general advice for your use case
- Please aggregate multiple elements then set a timer.
- Please don't create a timer per element, which would be excessive.
- Try and aggregate state, instead of accumulating large amount of state. I.e. aggregate as a sum and count, instead of storing every number when trying to compute a mean.
- Please consider session windows for this use case.
- In dataflow, state is not supported for merging windows. It is for beam.
- Please use state according to your access pattern, i.e. BagState for blind writes.
Here is an informative blog post with some more info on state "Stateful processing with Apache Beam."