When using the Scala standard lib, I can do somthing like this:
scala> val scalaList = List(1,2,3)
scalaList: List[Int] = List(1, 2, 3)
scala> scalaList.foldLeft(0)((acc,n)=>acc+n)
res0: Int = 6
Making one Int out of many Ints.
And I can do something like this:
scala> scalaList.foldLeft("")((acc,n)=>acc+n.toString)
res1: String = 123
Making one String out of many Ints.
So, foldLeft could be either homogeneous or heterogeneous, whichever we want, it's in one API.
While in Spark, if I want one Int out of many Ints, I can do this:
scala> val rdd = sc.parallelize(List(1,2,3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> rdd.fold(0)((acc,n)=>acc+n)
res1: Int = 6
The fold API is similar to foldLeft, but it is only homogeneous, a RDD[Int] can only produce Int with fold.
There is a aggregate API in spark too:
scala> rdd.aggregate("")((acc,n)=>acc+n.toString, (s1,s2)=>s1+s2)
res11: String = 132
It is heterogeneous, a RDD[Int] can produce a String now.
So, why are fold and aggregate implemented as two different APIs in Spark?
Why are they not designed like foldLeft that could be both homogeneous and heterogeneous?
(I am very new to Spark, please excuse me if this is a silly question.)
fold
can be implemented more efficiently because it doesn't depend on a fixed order of evaluation. So each cluster node canfold
its own chunk in parallel, and then one small overallfold
at the end. Whereas withfoldLeft
each element has to be folded in in order and nothing can be done in parallel.(Also it's nice to have a simpler API for the common case for convenience. The standard lib has
reduce
as well asfoldLeft
for this reason)Specifically in Spark, the computation is distributed and done in parallel, so
foldLeft
can't be implemented as it is in the standard library. Instead, the aggregate requires two functions, one that performs an operation similar tofold
on each element of typeT
, producing a value of typeU
, and another that combines theU
from each partition into the final value:foldLeft, foldRight, reduceLeft, reduceRight, scanLeft
andscanRight
are operations where the accumulated parameter can be different from the input parameters ((A, B) -> B
) and those operations can only be executed sequentially.fold
is an operation where the accumulated parameter has to be the same type of the input parameters ((A, A) -> A
). Then it can be executed in parallel.aggregation
is an operation where the accumulated parameter can be of different type as the input parameters, but then you have to provide an additional function that defines how the accumulated parameters can be aggregated in the final result. This operation allows parallel execution. Theaggregation
operation is a combination offoldLeft
andfold
.For more detailed information, you can have a look at the coursera videos for the "Parallel programming" course: