What is the difference between reduce
vs. fold
with respect to their technical implementation?
I understand that they differ by their signature as fold
accepts additional parameter (i.e. initial value) which gets added to each partition output.
- Can someone tell about use case for these two actions?
- Which would perform better in which scenario consider 0 is used for
fold
?
Thanks in advance.
There is no practical difference when it comes to performance whatsoever:
RDD.fold
action is usingfold
on the partitionIterators
which is implemented usingfoldLeft
.RDD.reduce
is usingreduceLeft
on the partitionIterators
.Both methods keep mutable accumulator and process partitions sequentially using simple loops with
foldLeft
implemented like this:and
reduceLeft
like this:Practical difference between these methods in Spark is only related to their behavior on empty collections and ability to use mutable buffer (arguably it is related to performance). You'll find some discussion in Why is the fold action necessary in Spark?
Moreover there is no difference in the overall processing model: