Spark Dataset aggregation similar to RDD aggregate

2020-02-15 02:23发布

RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?

Thanks a lot!

标签： scala apache-spark apache-spark-sql rdd apache-spark-dataset

1条回答

家丑人穷心不美

2楼-- · 2020-02-15 03:17

There are two different classes which can be used to achieve aggregate-like behavior in Dataset API:

UserDefinedAggregateFunction which uses SQL types and takes Columns as an input.

Initial value is defined using initialize method, seqOp with update method and combOp with merge method.

Example implementation: How to define a custom aggregation function to sum a column of Vectors?
Aggregator which uses standard Scala types with Encoders and takes records as an input.

Initial value is defined using zero method, seqOp with reduce method and combOp with merge method.

Example implementation: How to find mean of grouped Vector columns in Spark SQL?

Both provide additional finalization method (evaluate and finish respectively) which is used to generate final results and can be used for both global and by-key aggregations.

0人赞添加讨论(0) 举报

Spark Dataset aggregation similar to RDD aggregate

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间