Whenever I do a groupByKey
on an RDD, it gets split up in 200 jobs, even if the original table is quite large, e.g. 2k partitions and tens of millions of rows.
Moreover, the operation seems to get stuck on the last two tasks which take extremely long to compute.
Why is it 200? How to increase it and will it help?
This setting comes from
spark.sql.shuffle.partitions
, which is the number of partitions to use when grouping, and has a default setting of 200, but can be increased. This may help, it will be dependent on the cluster and data.The last two tasks taking very long will be due to skewed data, those keys contain many more values. Can you use
reduceByKey
/combineByKey
rather thangroupByKey
, or parallelize the problem differently?