I am trying to understand what the difference is between SLURM's srun
and sbatch
commands. I will be happy with a general explanation, rather than specific answers to the following questions, but here are some specific points of confusion that can be a starting point and give an idea of what I'm looking for.
According to the documentation, srun
is for submitting jobs, and sbatch
is for submitting jobs for later execution, but the practical difference is unclear to me, and their behavior seems to be the same. For example, I have a cluster with 2 nodes, each with 2 CPUs. If I execute srun testjob.sh &
5x in a row, it will nicely queue up the fifth job until a CPU becomes available, as will executing sbatch testjob.sh
.
To make the question more concrete, I think a good place to start might be: What are some things that I can do with one that I cannot do with the other, and why?
Many of the arguments to both commands are the same. The ones that seem the most relevant are --ntasks
, --nodes
, --cpus-per-task
, --ntasks-per-node
. How are these related to each other, and how do they differ for srun
vs sbatch
?
One particular difference is that srun
will cause an error if testjob.sh
does not have executable permission i.e. chmod +x testjob.sh
whereas sbatch
will happily run it. What is happening "under the hood" that causes this to be the case?
The documentation also mentions that srun
is commonly used inside of sbatch
scripts. This leads to the question: How do they interact with each other, and what is the "canonical" usecase for each them? Specifically, would I ever use srun
by itself?
This doesn't actually fully answer the question, but here is some more information I found that may be helpful for someone in the future:
From a related thread I found with a similar question:
Additional information from the SLURM FAQ page.
The documentation says
while
They both accept practically the same set of parameters. The main difference is that
srun
is interactive and blocking (you get the result in your terminal and you cannot write other commands until it is finished), whilesbatch
is batch processing and non-blocking (results are written to a file and you can submit other commands right away).If you use
srun
in the background with the&
sign, then you remove the 'blocking' feature ofsrun
, which becomes interactive but non-blocking. It is still interactive though, meaning that the output will clutter your terminal, and thesrun
processes are linked to your terminal. If you disconnect, you will loose control over them, or they might be killed (depending on whether they usestdout
or not basically). And they will be killed if the machine to which you connect to submit jobs is rebooted.If you use
sbatch
, you submit your job and it is handled by Slurm ; you can disconnect, kill your terminal, etc. with no consequence. Your job is no longer linked to a running process.A feature that is available to
sbatch
and not tosrun
is job arrrays. Assrun
can be used within ansbatch
script, there is nothing that you cannot do withsbatch
.All the parameters
--ntasks
,--nodes
,--cpus-per-task
,--ntasks-per-node
have the same meaning in both commands. That is true for nearly all parameters, with the notable exception of--exclusive
.srun
immediately executes the script on the remote host, whilesbatch
copies the script in an internal storage and then uploads it on the compute node when the job starts. You can check this by modifying your submission script after it has been submitted; changes will not be taken into account (see this).You typically use
sbatch
to submit a job andsrun
in the submission script to create job steps as Slurm calls them.srun
is used to launch the processes. If your program is a parallel MPI program,srun
takes care of creating all the MPI processes. If not,srun
will run your program as many times as specified by the--ntasks
option. There are many use cases depending on whether your program is paralleled or not, has a long running time or not, is composed of a single executable or not, etc. Unless otherwise specified,srun
inherits by default the pertinent options of thesbatch
orsalloc
which it runs under (from here).Other than for small tests, no. A common use is
srun --pty bash
to get a shell on a compute job.