We're currently encountering an issue where Spark jobs are seeing a number of containers being killed for exceeding memory limits when running on YARN.
16/11/18 17:58:52 WARN TaskSetManager: Lost task 53.0 in stage 49.0 (TID 32715, XXXXXXXXXX):
ExecutorLostFailure (executor 23 exited caused by one of the running tasks)
Reason: Container killed by YARN for exceeding memory limits. 12.4 GB of 12 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
The following arguments are being passed via spark-submit:
--executor-memory=6G
--driver-memory=4G
--conf "spark.yarn.executor.memoryOverhead=6G"`
I am using Spark 2.0.1.
We have increased the memoryOverhead to this value after reading several posts about YARN killing containers (e.g. How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?).
Given my parameters and the log message it does seem that "Yarn kills executors when its memory usage is larger than (executor-memory + executor.memoryOverhead)".
It is not practical to continue increasing this overhead in the hope that eventually we find a value at which these errors do not occur. We are seeing this issue on several different jobs. I would appreciate any suggestions as to parameters I should change, things I should check, where I should start looking to debug this etc. Am able to provide further config options etc.