I am running a Spark job with spark.task.maxFailures
set to 1, and according to the official documentation:
spark.task.maxFailures
Number of individual task failures before giving up on the job. Should be greater than or equal to 1. Number of allowed retries = this value - 1.
So my job should fail as soon as a task fails... However, it is trying a second time before giving up. Am I missing something? I have checked the property value in runtime just in case, and it is correctly set to 1. In my case, it fails in the last step, so the first attempt creates the output directory and the second one always fails because the output directory already exists, which is not really helpful.
Is there some kind of bug in this property or is the documentation wrong?
That is the number of individual task failures that are allowed, but what you are describing sounds like the actual job failing and being retried.
If you're running this with YARN, the job itself could be being resubmitted multiple times, see
yarn.resourcemanager.am.max-attempts
. If so, you could turn this setting down to 1.