AggregateByKey Partitioning?

2019-08-03 23:21发布

I have :

A_RDD = anRDD.map()

B_RDD = A_RDD.aggregateByKey()

Alright, my Question is :

If i put partitionBy(new HashPartitioner) after A_RDD like :

A_RDD = anRDD.map().partitionBy(new HashPartitioner(2))

B_RDD = A_RDD.aggregateByKey()

1)Will this be the same efficient as if i leave it as it is, in the first place? aggregateByKey() will use that hashPartitioner from A_RDD, right?

2)Or If i leave it as in the first example,aggregateByKey() will aggregate every partition by key first, and then send every "aggregated" (key, value) pair in a more efficient way to the right partition?

3)Why doesn't map,flatMap and other transformations on RDDs canNOT take an argument on how to partition the (key, value) pairs on the fly? What I mean is for example during the map() operation on every tuple lets say, => to send also this tuple to a specific partition that has been designated by a partitioner argument on map e.x: map( , Partitioner).

I am trying to grasp the concept of aggregateByKey() how it works, but every time i think i got this, a new question arises... Thanks in advance.

1条回答
劳资没心,怎么记你
2楼-- · 2019-08-04 00:06
  • If you put partitionBy before aggregateByKey it typically will be less efficient than aggregateByKey alone. You effectively disable map side combine.
  • If you leave there will be map side combine and it is typically more efficient.
  • Non shuffling operations don't take partitioner because there is no data movement. Operations are performed locally on each machine.
查看更多
登录 后发表回答