RDD
has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]
. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?
Thanks a lot!
VK
There are two different classes which can be used to achieve aggregate
-like behavior in Dataset
API:
UserDefinedAggregateFunction
which uses SQL
types and takes Columns
as an input.
Initial value is defined using initialize
method, seqOp
with update
method and combOp
with merge
method.
Example implementation: How to define a custom aggregation function to sum a column of Vectors?
Aggregator
which uses standard Scala types with Encoders
and takes records as an input.
Initial value is defined using zero
method, seqOp
with reduce
method and combOp
with merge
method.
Example implementation: How to find mean of grouped Vector columns in Spark SQL?
Both provide additional finalization method (evaluate
and finish
respectively) which is used to generate final results and can be used for both global and by-key aggregations.