I have :
A_RDD = anRDD.map()
B_RDD = A_RDD.aggregateByKey()
Alright, my Question is :
If i put partitionBy(new HashPartitioner) after A_RDD like :
A_RDD = anRDD.map().partitionBy(new HashPartitioner(2))
B_RDD = A_RDD.aggregateByKey()
1)Will this be the same efficient as if i leave it as it is, in the first place? aggregateByKey() will use that hashPartitioner from A_RDD, right?
2)Or If i leave it as in the first example,aggregateByKey() will aggregate every partition by key first, and then send every "aggregated" (key, value) pair in a more efficient way to the right partition?
3)Why doesn't map,flatMap and other transformations on RDDs canNOT take an argument on how to partition the (key, value) pairs on the fly? What I mean is for example during the map() operation on every tuple lets say, => to send also this tuple to a specific partition that has been designated by a partitioner argument on map e.x: map( , Partitioner).
I am trying to grasp the concept of aggregateByKey() how it works, but every time i think i got this, a new question arises... Thanks in advance.
partitionBy
beforeaggregateByKey
it typically will be less efficient thanaggregateByKey
alone. You effectively disable map side combine.