We use a Spark cluster as yarn-client
to calculate several business, but sometimes we have a task run too long time:
We don't set timeout but I think default timeout a spark task is not too long such here ( 1.7h ).
Anyone give me an ideal to work around this issue ???
The trick here is to login directly to the worker node and kill the process. Usually you can find the offending process with a combination of
top
,ps
, andgrep
. Then just do akill pid
.There is no way for spark to kill its tasks if its taking too long.
But I figured out a way to handle this using speculation,
Note:
spark.speculation.quantile
means the "speculation" will kick in from your first task. So use it with caution. I am using it because some jobs get slowed down due to GC over time. So I think you should know when to use this - its not a silver bullet.Some relevant links: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-always-wait-for-stragglers-to-finish-running-td14298.html and http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAPmMX=rOVQf7JtDu0uwnp1xNYNyz4xPgXYayKex42AZ_9Pvjug@mail.gmail.com%3E
Update
I found a fix for my issue (might not work for everyone). I had a bunch of simulations running per task, so I added timeout around the run. If a simulation is taking longer (due to a data skew for that specific run), it will timeout.
Make sure you handle an interrupt inside the
simulator
's main loop like: