How to create custom Combine.PerKey in beam sdk 2.

2019-07-10 05:17发布

We figured out how to create a custom combine function (after lots of guesswork and beam sdk 2.0 code reading) in beam sdk 2.0, as the dataflow sdk 1.x syntax did not work in sdk 2.0.

However, we can't figure out how to create a custom combine PER KEY function in beam sdk 2.0. Any help or pointers (or better yet an actual example) would be greatly appreciated. (We scoured the internet for documentation or examples and found none; we also attempted to look at the code within beam sdk 2.0's Combine class, but couldn't figure it out, especially since the PerKey class now has a private constructor, so we can't extend it any longer.)

In case it helps, here's how we correctly created a custom combiner (without) keys in beam sdk 2.0, but we can't figure out how to create one with a key:

public class CombineTemplateIntervalsIntoBlocks
        extends Combine.AccumulatingCombineFn<ImmutableMySetOfIntervals, TemplateIntervalAccum, ArrayList<ImmutableMySetOfIntervals>>{


    public CombineTemplateIntervalsIntoBlocks() {
    }

    @Override
    public TemplateIntervalAccum createAccumulator() {
        return new TemplateIntervalAccum()
    }

and then

public class TemplateIntervalAccum
        implements Combine.AccumulatingCombineFn.Accumulator<ImmutableMySetOfIntervals, TemplateIntervalAccum, ArrayList<ImmutableMySetOfIntervals>>, Serializable {
...

1条回答
叛逆
2楼-- · 2019-07-10 05:55

You don't need to create your CombineFn differently to use a Combine.PerKey.

You can extend either AccumulatingCombineFn (which puts the merging logic in the accumulator) or extend CombineFn (which puts the merging logic in the CombineFn). There are also other options such as BinaryCombineFn and IterableCombineFn.

Say that you have a CombineFn<InputT, AccumT, OutputT> called combineFn:

  • You can use Combine.globally(combineFn) to create a PTransform that takes a PCollection<InputT> and combines all the elements.
  • Or, you can use Combine.perKey(combineFn) to create a PTransform that takes a PCollection<KV<K, InputT>> and combines all the values associated with a each key and combines them. This corresponds to the Combine.PerKey I believe you are referring to.
查看更多
登录 后发表回答