So, right now I'm submitting jobs on a cluster with qsub
, but they seem to always run on a single node. I currently run them by doing
#PBS -l walltime=10
#PBS -l nodes=4:gpus=2
#PBS -r n
#PBS -N test
range_0_total = $(seq 0 $(expr $total - 1))
for i in $range_0_total
do
$PATH_TO_JOB_EXEC/job_executable &
done
wait
I would be incredibly grateful if you could tell me if I'm doing something wrong, or if it's just that my test tasks are too small.
With your approach, you need to have your for loop go through all of the entries in the file pointed to by $PBS_NODEFILE and then inside of you loop you would need "ssh $i $PATH_TO_JOB_EXEC/job_executable &".
The other, easier way to do this would be to replace the for loop and wait with:
pbsdsh $PATH_TO_JOB_EXEC/job_executable
This would run a copy of your program on each core assigned to your job. If you need to modify this behavior take a look at the options available in the pbsdsh man page.