reduce() vs. fold() in Apache Spark

2019-05-08 04:34发布

What is the difference between reduce vs. fold with respect to their technical implementation?

I understand that they differ by their signature as fold accepts additional parameter (i.e. initial value) which gets added to each partition output.

Can someone tell about use case for these two actions?
Which would perform better in which scenario consider 0 is used for fold?

Thanks in advance.

标签： scala apache-spark rdd reduce fold

1条回答

The star\"

2楼-- · 2019-05-08 05:31

There is no practical difference when it comes to performance whatsoever:

RDD.fold action is using fold on the partition Iterators which is implemented using foldLeft.
RDD.reduce is using reduceLefton the partition Iterators.

Both methods keep mutable accumulator and process partitions sequentially using simple loops with foldLeft implemented like this:

foreach (x => result = op(result, x))

and reduceLeft like this:

for (x <- self) {
  if (first) {
    ...
  }
  else acc = op(acc, x)
}

Practical difference between these methods in Spark is only related to their behavior on empty collections and ability to use mutable buffer (arguably it is related to performance). You'll find some discussion in Why is the fold action necessary in Spark?

Moreover there is no difference in the overall processing model:

Each partition is processed sequentially using a single thread.
Partitions are processed in parallel using multiple executors / executor threads.
Final merge is performed sequentially using a single thread on the driver.

0人赞添加讨论(0) 举报

reduce() vs. fold() in Apache Spark

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间