Why does groupByKey operation have always 200 task

2019-06-26 11:12发布

Whenever I do a groupByKey on an RDD, it gets split up in 200 jobs, even if the original table is quite large, e.g. 2k partitions and tens of millions of rows.

Moreover, the operation seems to get stuck on the last two tasks which take extremely long to compute.

Why is it 200? How to increase it and will it help?

1条回答
在下西门庆
2楼-- · 2019-06-26 11:38

This setting comes from spark.sql.shuffle.partitions, which is the number of partitions to use when grouping, and has a default setting of 200, but can be increased. This may help, it will be dependent on the cluster and data.

The last two tasks taking very long will be due to skewed data, those keys contain many more values. Can you use reduceByKey / combineByKey rather than groupByKey, or parallelize the problem differently?

查看更多
登录 后发表回答