Batch PCollection in Beam/Dataflow

2019-03-04 10:48发布

I have a PCollection in GCP Dataflow/Apache Beam. Instead of processing it one by one, I need to combine "by N". Something like grouped(N). So, in case of bounded processing, it will group by 10 items in batch and last batch with whatever left. Is this possible in Apache Beam?

标签： google-cloud-dataflow apache-beam

1条回答

Ridiculous、

2楼-- · 2019-03-04 11:29

Edit, looks like: Google Dataflow "elementCountExact" aggregation

You should be able to do something similar by assigning elements to global window and using AfterPane.elementCountAtLeast(N). You still need to account for what what if there isn’t enough elements to fire the trigger. You could use this:

 Repeatedly.forever(AfterFirst.of(
  AfterPane.elementCountAtLeast(N),
  AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(X))))

But you should ask yourself why do you need this heuristic in the first place, there probably is more idomatice way to solve your problem. Read about Data-Driven Triggers in Beam’s programming guide

0人赞添加讨论(0) 举报

Batch PCollection in Beam/Dataflow

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间