My goal is to build a pipeline using slurm dependencies and handle a case where a slurm job crashes.
Based on following answer and guide 29th section, it is recommended to use scontrol requeue $jobID
, that will re-queue the already cancelled job.
if job crashes can be detected from within the submission script, and
crashes are random, you can simply requeue the job with scontrol requeue $SLURM_JOB_ID
so that it runs again.
After I have re-queued a cancelled job, its dependent job remain as DependencyNeverSatisfied
and even dependent job completed nothing happens. Is there any way to update dependent job's state, if cancelled job is re-queued again?
Example:
$ sbatch run.sh
Submitted batch job 1
$ sbatch --dependency=aftercorr:1 run.sh
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
89 debug run.sh alper PD 0:00 1 (Dependency)
88 debug run.sh alper R 0:23 1 ebloc1
$ scancel 1
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
89 debug run.sh alper PD 0:00 1 (DependencyNeverSatisfied)
$ scontrol requeue 1
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
89 debug run.sh alper PD 0:00 1 (DependencyNeverSatisfied)
88 debug run.sh alper R 0:00 1 ebloc1
#After running job completed dependent job still remain as DependencyNeverSatisfied state:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
89 debug run.sh alper PD 0:00 1 (DependencyNeverSatisfied)
After I have re-queued a cancelled job, its dependent job remain as DependencyNeverSatisfied and even dependent job completed nothing happens. Is there any way to update dependent job's state, if cancelled job is re-queued again?
Yes, it's quite simple. Reset the dependency with scontrol
.
scontrol update jobid=[dependent job id] dependency=after:[requeued job id]
I've done this as an example with Slurm version 17.11:
$ sbatch --begin=now+60 --wrap="exit 1"
Submitted batch job 540912
$ sbatch --dependency=afterok:540912 --wrap=hostname
Submitted batch job 540913
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
540912 debug wrap marshall PD 0:00 1 (BeginTime)
540913 debug wrap marshall PD 0:00 1 (Dependency)
$ scancel 540912
$ scontrol requeue 540912
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
540912 debug wrap marshall PD 0:00 1 (BeginTime)
540913 debug wrap marshall PD 0:00 1 (DependencyNeverSatisfied)
At this point, I've replicated your situation. Job 540912 has been requeued, and job 540913 has the reason "DependencyNeverSatisfied".
Now, you can fix it by issuing scontrol update job
:
$ scontrol update jobid=540913 dependency=after:540912
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
540912 debug wrap marshall PD 0:00 1 (BeginTime)
540913 debug wrap marshall PD 0:00 1 (Dependency)
The state is fixed! Once the job runs, the dependent job also runs:
$ scontrol update jobid=540912 starttime=now
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
540912 debug wrap marshall CG 0:00 1 v1
540913 debug wrap marshall PD 0:00 1 (Dependency)
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
squeue
's output is empty because the job already completed.
You can see the jobs after they've completed with sacct
:
$ sacct -j 540912,540913
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
540912 wrap debug test 2 FAILED 1:0
540912.batch batch test 2 FAILED 1:0
540912.exte+ extern test 2 COMPLETED 0:0
540913 wrap debug test 2 COMPLETED 0:0
540913.batch batch test 2 COMPLETED 0:0
540913.exte+ extern test 2 COMPLETED 0:0