RDD
has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]
. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?
Thanks a lot!
VK
There are two different classes which can be used to achieve
aggregate
-like behavior inDataset
API:UserDefinedAggregateFunction
which usesSQL
types and takesColumns
as an input.Initial value is defined using
initialize
method,seqOp
withupdate
method andcombOp
withmerge
method.Example implementation: How to define a custom aggregation function to sum a column of Vectors?
Aggregator
which uses standard Scala types withEncoders
and takes records as an input.Initial value is defined using
zero
method,seqOp
withreduce
method andcombOp
withmerge
method.Example implementation: How to find mean of grouped Vector columns in Spark SQL?
Both provide additional finalization method (
evaluate
andfinish
respectively) which is used to generate final results and can be used for both global and by-key aggregations.