Batch PCollection in Beam/Dataflow

2019-03-04 10:49发布

问题:

I have a PCollection in GCP Dataflow/Apache Beam. Instead of processing it one by one, I need to combine "by N". Something like grouped(N). So, in case of bounded processing, it will group by 10 items in batch and last batch with whatever left. Is this possible in Apache Beam?

回答1:

Edit, looks like: Google Dataflow "elementCountExact" aggregation

You should be able to do something similar by assigning elements to global window and using AfterPane.elementCountAtLeast(N). You still need to account for what what if there isn’t enough elements to fire the trigger. You could use this:

 Repeatedly.forever(AfterFirst.of(
  AfterPane.elementCountAtLeast(N),
  AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(X))))

But you should ask yourself why do you need this heuristic in the first place, there probably is more idomatice way to solve your problem. Read about Data-Driven Triggers in Beam’s programming guide