SPARK: YARN kills containers for exceeding memory

We're currently encountering an issue where Spark jobs are seeing a number of containers being killed for exceeding memory limits when running on YARN.

16/11/18 17:58:52 WARN TaskSetManager: Lost task 53.0 in stage 49.0 (TID 32715, XXXXXXXXXX): 
  ExecutorLostFailure (executor 23 exited caused by one of the running tasks) 
  Reason: Container killed by YARN for exceeding memory limits. 12.4 GB of 12 GB physical memory used. 
    Consider boosting spark.yarn.executor.memoryOverhead.

The following arguments are being passed via spark-submit:

--executor-memory=6G
--driver-memory=4G
--conf "spark.yarn.executor.memoryOverhead=6G"`

I am using Spark 2.0.1.

We have increased the memoryOverhead to this value after reading several posts about YARN killing containers (e.g. How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?).

Given my parameters and the log message it does seem that "Yarn kills executors when its memory usage is larger than (executor-memory + executor.memoryOverhead)".

It is not practical to continue increasing this overhead in the hope that eventually we find a value at which these errors do not occur. We are seeing this issue on several different jobs. I would appreciate any suggestions as to parameters I should change, things I should check, where I should start looking to debug this etc. Am able to provide further config options etc.

You can reduce the memory usage with the following configurations in spark-defaults.conf:

spark.default.parallelism
spark.sql.shuffle.partitions

And there is a difference when you use more than 2000 partitions for spark.sql.shuffle.partitions. You can see it in the code of spark on Github:

private[spark] object MapStatus {

  def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
    if (uncompressedSizes.length > 2000) {
      HighlyCompressedMapStatus(loc, uncompressedSizes)
    } else {
      new CompressedMapStatus(loc, uncompressedSizes)
    }
}

I recommend to try to use more than 2000 Partitions for a test. It could be faster some times, when you use very huge datasets. And according to this your tasks can be short as 200 ms. The correct configuration is not easy to find, but depending on your workload it can make a difference of hours.