AggregateByKey Partitioning?

2019-08-03 23:21发布

I have :

A_RDD = anRDD.map()

B_RDD = A_RDD.aggregateByKey()

Alright, my Question is :

If i put partitionBy(new HashPartitioner) after A_RDD like :

A_RDD = anRDD.map().partitionBy(new HashPartitioner(2))

B_RDD = A_RDD.aggregateByKey()

1)Will this be the same efficient as if i leave it as it is, in the first place? aggregateByKey() will use that hashPartitioner from A_RDD, right?

2)Or If i leave it as in the first example,aggregateByKey() will aggregate every partition by key first, and then send every "aggregated" (key, value) pair in a more efficient way to the right partition?

3)Why doesn't map,flatMap and other transformations on RDDs canNOT take an argument on how to partition the (key, value) pairs on the fly? What I mean is for example during the map() operation on every tuple lets say, => to send also this tuple to a specific partition that has been designated by a partitioner argument on map e.x: map( , Partitioner).

I am trying to grasp the concept of aggregateByKey() how it works, but every time i think i got this, a new question arises... Thanks in advance.

标签： scala apache-spark hadoop-partitioning

1条回答

劳资没心，怎么记你

2楼-- · 2019-08-04 00:06

If you put partitionBy before aggregateByKey it typically will be less efficient than aggregateByKey alone. You effectively disable map side combine.
If you leave there will be map side combine and it is typically more efficient.
Non shuffling operations don't take partitioner because there is no data movement. Operations are performed locally on each machine.

0人赞添加讨论(0) 举报

AggregateByKey Partitioning?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间