Why pre-partition will benefit spark job because o

Many tutorials mention that pre-partition of RDD will optimize data shuffling of spark jobs. What I'm confused is that, for my understanding pre-partition will also lead to shuffling, why shuffling in advance here will benefit some operation? Especially spark it self will do the optimization for a set of transformations.

For example:

If I want to join two dataset country (id, country) and income (id, (income, month, year)), what's the difference between this two kind of operation? (I use PySpark schema)

pre-partition by id

country = country.partitionBy(10).persist()
income = income.partitionBy(10).persist()
income.join(country)

directly join without pre-partition:
```
income.join(country)
```

If I only need to calculate this join once, is it still useful to use pre-partition before join? I think partitionBy also requires shuffling right? And if my further computation after join is all base on using country as key (previous key id used for join will be useless and be eliminated from RDD), what should I do to optimize the calculation?

标签： hadoop apache-spark pyspark rdd partition

2条回答

相关推荐>>

2楼-- · 2019-08-02 20:25

parititionBy does not shuffle the data, if that's what you are asking.

By applying partitionBy preemptively you don't avoid the shuffle. You just push it in another place. This can be a good idea if partitioned RDD is reused multiple times, but you gain nothing for a one-off join.

0人赞添加讨论(0) 举报

▲ chillily

3楼-- · 2019-08-02 20:27

If I only need to calculate this join once, is it still useful to use pre-partition before join? I think partitionBy also requires shuffling right?

You're perfectly right. Preemptive partitioning makes sense only if partitioned data will be reused for multiple DAG paths. If you join only once it just shuffles in a different place.

0人赞添加讨论(0) 举报

Why pre-partition will benefit spark job because o

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间