You can see the implementation here: https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804
How does it different from the 'normal' reduce
function?
What does it mean depth = 2
?
I don't want that the reducer function will pass linearly on the partitions, but reduce each available pairs first, and then will iterate like that until i have only one pair and reduce it to 1, as shown in the picture:
Does treeReduce
achieve that?
Standard
reduce
is taking a wrapped version of the function and using it tomapPartitions
. After that results are collected and reduced locally on a driver. If number of the partitions is large and/or function you use is expensive it places a significant load on a single machine.The first phase of the
treeReduce
is pretty much the same as above but after that partial results are merged in parallel and only the final aggregation is performed on the driver.depth
is suggested depth of the tree and since depth of the node in tree is defined as number of edges between the root and the node it should you give you more or less an expected pattern although it looks like a distributed aggregation can be stopped early in some cases.It is worth to note that what you get with
treeReduce
is not a binary tree. Number of the partitions is adjusted on each level and most likely more than a two partitions will be merged at once.Compared to the standard reduce, tree based version performs
reduceByKey
with each iteration and it means a lot of data shuffling. If number of the partitions is relatively small it will be much cheaper to use plainreduce
. If you suspect that the final phase of thereduce
is a bottlenecktree*
version could be worth trying.