Why does groupByKey operation have always 200 task

2019-06-26 11:12发布

Whenever I do a groupByKey on an RDD, it gets split up in 200 jobs, even if the original table is quite large, e.g. 2k partitions and tens of millions of rows.

Moreover, the operation seems to get stuck on the last two tasks which take extremely long to compute.

Why is it 200? How to increase it and will it help?

标签： apache-spark

1条回答

在下西门庆

2楼-- · 2019-06-26 11:38

This setting comes from spark.sql.shuffle.partitions, which is the number of partitions to use when grouping, and has a default setting of 200, but can be increased. This may help, it will be dependent on the cluster and data.

The last two tasks taking very long will be due to skewed data, those keys contain many more values. Can you use reduceByKey / combineByKey rather than groupByKey, or parallelize the problem differently?

0人赞添加讨论(0) 举报

Why does groupByKey operation have always 200 task

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间