one-to-one dependency between two job arrays in SL

2019-04-14 10:28发布

问题:

The server just switched from CONDOR to SLURM, so I am learning and trying to translate my submission script to SLURM.

My question is the following, I have two job arrays. The second is dependent on the first one. For the time being, I something like the following

events1=$(sbatch --job-name=events --array=1-3 --output=z-events-%a.stdout myfirst.sh)
jobid_events1=`echo ${events1} | sed -n -e 's/^.*job //p' `
echo "The job ID of the events is "${jobid_events1}

postevents1=$(sbatch --job-name=postevents --dependency=afterany:${jobid_events1} --array=1-3 mysecond.sh)
jobid_postevents1=`echo ${postevents-cftables1} | sed -n -e 's/^.*job //p' `
echo "The job ID post-event calculations is "${jobid_postevents1}

Here the second job array postevents1 will only start after every element from the first job array events1 has finished. However, what I indeed want is that the i-th element of the second job array only depends on the i-th element of the first job array (In practice, both arrays always have the same size). I know this can be done by using DAG in the case of CONDOR.

I realize that I can manual break the second job array and make the match individually. However, since I will have to break the second job array, it becomes increasingly inconvenient for me if a third job is dependent on all the elements of the second job array.

Edit: According to damienfrancois's answer, the keyword aftercorr is what I was looking for. I have a follow-up question.

Although, at a first glimpse, "complete successfully" does make perfect sense. However, if one of the tasks (in the first job array) does not complete successfully, one has to delete the manually the corresponding task in the second array? If so, what makes it potentially complicated is that any further job which is dependent on partial completion of the tasks of the second job array will all be hung there if any task in the first job array fails(, which is pretty common in my practice). In this case how does one implement the "afterany" option?

Many thanks in advance!

回答1:

Since version 16.05, Slurm has an option of --dependency=aftercorr:job_id[:jobid...]

A task of this job array can begin execution after the corresponding task ID in the specified job has completed successfully (ran to completion with an exit code of zero).

It does what you need.

It however has the drawback you describe; jobs in the second array will keep waiting indefinitely if the corresponding job in the first array crashes. You have several courses of action, none of which is perfect:

  1. if job crashes can be detected from within the submission script, and crashes are random, you can simply requeue the job with scontrol requeue $SLURM_JOB_ID so that it runs again.

  2. otherwise, you can add, at the end of the jobs in the second array, a piece of Bash code that would check whether any job from the first array is still in the queue, and if not, cancel all remaining jobs in the second array ; something like this (untested) [[ $(squeue --noheader --name events | wc -l) == 0 ]] && scancel $SLURM_JOB_ID

  3. finally, a last option is to use a full-fledge workflow system. See this for a short introduction and pointers.



标签: slurm