Many tutorials mention that pre-partition of RDD
will optimize data shuffling of spark jobs. What I'm confused is that, for my understanding pre-partition will also lead to shuffling, why shuffling in advance here will benefit some operation? Especially spark it self will do the optimization for a set of transformations.
For example:
If I want to join two dataset country (id, country) and income (id, (income, month, year)), what's the difference between this two kind of operation? (I use PySpark schema)
pre-partition by id
country = country.partitionBy(10).persist() income = income.partitionBy(10).persist() income.join(country)
directly join without pre-partition:
income.join(country)
If I only need to calculate this join once, is it still useful to use pre-partition before join? I think partitionBy
also requires shuffling right? And if my further computation after join is all base on using country as key (previous key id used for join will be useless and be eliminated from RDD
), what should I do to optimize the calculation?