When create two different Spark Pair RDD with same

2019-07-01 19:04发布

I want to do a join operation between two very big key-value pair RDDs. The keys of these two RDD comes from the same set. To reduce data shuffle, I wish I could add a pre-distribute phase so that partitions with the same key will be distributed on the same machine. Hopefully this could reduce some shuffle time.

I want to know is spark smart enough to do that for me or I have to implement this logic myself?

I know when I join two RDD, one preprocess with partitionBy. Spark is smart enough to use this information and only shuffle the other RDD. But I don't know what will happen if I use partitionBy on two RDD at the same time and then do the join.

标签： scala join apache-spark rdd

2条回答

趁早两清

2楼-- · 2019-07-01 19:24

I have seen this, Speeding Up Joins by Assigning a Known Partitioner that would be helpful to understand the effect of using the same partitioner for both RDDs;

Speeding Up Joins by Assigning a Known Partitioner
If you have to do an operation before the join that requires a shuffle, such as aggregateByKey or reduceByKey, you can prevent the shuffle by adding a hash partitioner with the same number of partitions as an explicit argument to the first operation and persisting the RDD before the join.

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2019-07-01 19:45

If you use the same partitioner for both RDDs you achieve co-partitioning of your data sets. That does not necessarily mean that your RDDs are co-located - that is, that the partitioned data is located on the same node.

Nevertheless, the performance should be better as if both RDDs would have different partitioner.

0人赞添加讨论(0) 举报

When create two different Spark Pair RDD with same

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间