I have a small cluster with nodes A, B, C and D. Each node has 80GB RAM and 32 CPUs. I am using Slurm 17.11.7.
I performed the following benchmark tests:
- If I run a particular Java command directly on terminal on node A, I get an result in 2minutes.
- If I run the same command with an "single" array job (#SBATCH --array=1-1), I get an result again in 2minutes.
- If I ran the same command with same parameters with an array job on slurm only on node A, I get the output in 8mininutes, that is, it is four times slower. Here, I, of course, run 31 other Java commands with different parameters at the same time.
I already tried SelectTypeParameters=CR_CPU_Memory and SelectTypeParameters=CR_Core with the same result.
Why is my array job 4 times slower? Thanks for your help!
The header of my array job, which I submit, looks like this:
#!/bin/bash -l
#SBATCH --array=1-42
#SBATCH --job-name exp
#SBATCH --output logs/output_%A_%a.txt
#SBATCH --error logs/error_%A_%a.txt
#SBATCH --time=20:00
#SBATCH --mem=2048
#SBATCH --cpus-per-task=1
#SBATCH -w <NodeA>
The slurm.conf file looks like:
ControlMachine=<NodeA>
ControlAddr=<IPNodeA>
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=<test_user_123>
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
MaxJobCount=100000
MaxArraySize=15000
MinJobAge=300
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=Cluster
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log
# COMPUTE NODES
#NodeName=NameA-D> State=UNKNOWN
NodeName=<NameA> NodeAddr=<IPNodeA> State=UNKNOWN CPUs=32 RealMemory=70363
NodeName=<NameB> NodeAddr=<IPNodeB> State=UNKNOWN CPUs=32 RealMemory=70363
NodeName=<NameC> NodeAddr=<IPNodeC> State=UNKNOWN CPUs=32 RealMemory=70363
NodeName=<NameD> NodeAddr=<IPNodeD> State=UNKNOWN CPUs=32 RealMemory=70363
PartitionName=debug Nodes=<NodeA-D> Default=YES MaxTime=INFINITE State=UP