I have a PCollection in GCP Dataflow/Apache Beam. Instead of processing it one by one, I need to combine "by N". Something like grouped(N)
. So, in case of bounded processing, it will group by 10 items in batch and last batch with whatever left.
Is this possible in Apache Beam?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
Edit, looks like: Google Dataflow "elementCountExact" aggregation
You should be able to do something similar by assigning elements to global window and using AfterPane.elementCountAtLeast(N)
. You still need to account for what what if there isn’t enough elements to fire the trigger. You could use this:
Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(N),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(X))))
But you should ask yourself why do you need this heuristic in the first place, there probably is more idomatice way to solve your problem. Read about Data-Driven Triggers
in Beam’s programming guide