SLURM sbatch multiple parallel calls to executable

2019-04-28 23:05发布

问题:

I have an executable that takes multiple options and multiple file inputs in order to run. The executable can be called with a variable number of cores to run.

E.g. executable -a -b -c -file fileA --file fileB ... --file fileZ --cores X

I'm trying to create an sbatch file that will enable me to have multiple calls of this executable with different inputs. Each call should be allocated in a different node (in parallel with the rest), using X cores. The parallelization at core level is taken care of the executable, while at the node level by SLURM.

I tried with ntasks and multiple sruns but the first srun was called multiple times.

Another take was to rename the files and use a SLURM process or node number as filename before the extension but it's not really practical.

Any insight on this?

回答1:

i do these kind of jobs always with the help of bash script that i run by a sbatch command. The easiest approach would be to have a loop in a sbatch script where you spawn the different job and job steps under your executable with srun specifying i.e. the corresponding node name in your partion with -w . You may also read up the documentation of slurm array jobs (if that befits you better). Alternatively you could also store all parameter combinations in a file and than loop over them with the script of have a look at "array job" manual page.

Maybe the following script (i just wrapped it up) helps you to get a feeling for what i have in mind (i hope its what you need). Its not tested so dont just copy and paste it!

#!/bin/bash

parameter=(10 5 2)
node_names=(node1 node2 node3)


# lets run one job per node each time taking one parameter

for parameter in ${parameter[*]}
    # asign parameter to node
    #script some if else condition here to specify parameters
    # -w specifies the name of the node to use
    # -N specifies the amount of nodes
    JOBNAME="jmyjob$node-$parameter"
    # asign the first job to the node
    $node=${node_names[0]}
    #delete first node from list
    unset node_names[0];
    #reinstantiate list
    node_names=("${Unix[@]}")
    srun -N1 -w$node -psomepartition -JJOBNAME executable.sh model_parameter &

done;

You will have the problem that you need to force your sbatch script to wait for the last job step. In this case the follwoing additional while loop might help you.

# Wait for the last job step to complete
while true;
do
    # wait for last job to finish use the state of sacct for that
    echo "waiting for last job to finish"
    sleep 10
    # sacct shows your jobs, -R only running steps
    sacct -s R,gPD|grep "myjob*" #your job name indicator
    # check the status code of grep (1 if nothing found)
    if [ "$?" == "1" ];
    then
    echo "found no running jobs anymore"
    sacct -s R |grep "myjob*"
    echo "stopping loop"
    break;
    fi
done;


回答2:

I managed to find one possible solution, so I'm posting it for reference:

I declared as many tasks as calls to the executable, as well as nodes and the desired number of cpus per call.

And then a separate srun for each call, declaring the number of nodes and tasks at each call. All the sruns are bound with ampersands (&):

srun -n 1 -N 1 --exclusive executable -a1 -b1 -c1 -file fileA1 --file fileB1 ... --file fileZ1 --cores X1 &

srun -n 1 -N 1 --exclusive executable -a2 -b2 -c2 -file fileA2 --file fileB2 ... --file fileZ2 --cores X2 &

....

srun -n 1 -N 1 --exclusive executable -aN -bN -cN -file fileAN --file fileBN ... --file fileZN --cores XN

--Edit: After some tests (as I mentioned in a comment below), if the process of the last srun ends before the rest, it seems to end the whole job, leaving the rest unfinished.

--edited based on the comment by Carles Fenoy



回答3:

Write a bash script to populate multiple xyz.slurm files and submit each of them using sbatch. Following script does a a nested for loop to create 8 files. Then iterate over them to replace a string in those files, and then batch them. You might need to modify the script to suit your need.

#!/usr/bin/env bash
#Path Where you want to create slurm files
slurmpath=~/Desktop/slurms
rm -rf $slurmpath
mkdir -p $slurmpath/sbatchop
mkdir -p /exports/home/schatterjee/reports
echo "Folder /slurms and /reports created"

declare -a threads=("1" "2" "4" "8")
declare -a chunks=("1000" "32000")
declare -a modes=("server" "client")

## now loop through the above array
for i in "${threads[@]}"
{
    for j in "${chunks[@]}"
    {
#following are the content of each slurm file
cat <<EOF >$slurmpath/net-$i-$j.slurm
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=$slurmpath/sbatchop/net-$i-$j.out
#SBATCH --wait-all-nodes=1
echo \$SLURM_JOB_NODELIST

cd /exports/home/schatterjee/cs553-pa1

srun ./MyNETBench-TCP placeholder1 $i $j
EOF
    #Now schedule them
      for m in "${modes[@]}"
      {
        for value in {1..5}
        do
        #Following command replaces placeholder1 with the value of m
        sed -i -e 's/placeholder1/'"$m"'/g' $slurmpath/net-$i-$j.slurm
        sbatch $slurmpath/net-$i-$j.slurm
        done
      }
   }
}


标签: slurm