AggregateByKey Partitioning?

2019-08-03 23:39发布

问题:

I have :

A_RDD = anRDD.map()

B_RDD = A_RDD.aggregateByKey()

Alright, my Question is :

If i put partitionBy(new HashPartitioner) after A_RDD like :

A_RDD = anRDD.map().partitionBy(new HashPartitioner(2))

B_RDD = A_RDD.aggregateByKey()

1)Will this be the same efficient as if i leave it as it is, in the first place? aggregateByKey() will use that hashPartitioner from A_RDD, right?

2)Or If i leave it as in the first example,aggregateByKey() will aggregate every partition by key first, and then send every "aggregated" (key, value) pair in a more efficient way to the right partition?

3)Why doesn't map,flatMap and other transformations on RDDs canNOT take an argument on how to partition the (key, value) pairs on the fly? What I mean is for example during the map() operation on every tuple lets say, => to send also this tuple to a specific partition that has been designated by a partitioner argument on map e.x: map( , Partitioner).

I am trying to grasp the concept of aggregateByKey() how it works, but every time i think i got this, a new question arises... Thanks in advance.

回答1:

If you put partitionBy before aggregateByKey it typically will be less efficient than aggregateByKey alone. You effectively disable map side combine.
If you leave there will be map side combine and it is typically more efficient.
Non shuffling operations don't take partitioner because there is no data movement. Operations are performed locally on each machine.

AggregateByKey Partitioning?

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮