I am submitting multiple jobs to a SLURM queue. Each job uses 1 GPU. We have 4 GPUs per node. However once a job is running, it takes up the entire node, leaving 3 GPUs idle. Is there any way to avoid this, so that I can send multiple jobs to one node, using one GPU each?
My script looks like this:
#SLURM --gres=gpu:1
#SLURM --ntasks-per-node 1
#SLURM -p ghp-queue
myprog.exe
I was also unable to run multiple jobs on different GPUs. What helped was adding
OverSubscribe=FORCE
to the partition configuration inslurm.conf
, like this:After that, I was able to run four jobs with
--gres=gpu:1
, and each one took a different GPU (a fifth job is queued, as expected).