SLURM slow for array job

2019-08-28 00:45发布

问题:

I have a small cluster with nodes A, B, C and D. Each node has 80GB RAM and 32 CPUs. I am using Slurm 17.11.7.

I performed the following benchmark tests:

  • If I run a particular Java command directly on terminal on node A, I get an result in 2minutes.
  • If I run the same command with an "single" array job (#SBATCH --array=1-1), I get an result again in 2minutes.
  • If I ran the same command with same parameters with an array job on slurm only on node A, I get the output in 8mininutes, that is, it is four times slower. Here, I, of course, run 31 other Java commands with different parameters at the same time.

I already tried SelectTypeParameters=CR_CPU_Memory and SelectTypeParameters=CR_Core with the same result.

Why is my array job 4 times slower? Thanks for your help!

The header of my array job, which I submit, looks like this:

#!/bin/bash -l
#SBATCH --array=1-42
#SBATCH --job-name exp
#SBATCH --output logs/output_%A_%a.txt
#SBATCH --error logs/error_%A_%a.txt
#SBATCH --time=20:00
#SBATCH --mem=2048
#SBATCH --cpus-per-task=1
#SBATCH -w <NodeA>

The slurm.conf file looks like:

ControlMachine=<NodeA>
ControlAddr=<IPNodeA>
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=<test_user_123>
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity

MaxJobCount=100000
MaxArraySize=15000

MinJobAge=300
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory

# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=Cluster
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log

# COMPUTE NODES
#NodeName=NameA-D> State=UNKNOWN
NodeName=<NameA> NodeAddr=<IPNodeA> State=UNKNOWN CPUs=32 RealMemory=70363
NodeName=<NameB> NodeAddr=<IPNodeB> State=UNKNOWN CPUs=32 RealMemory=70363
NodeName=<NameC> NodeAddr=<IPNodeC> State=UNKNOWN CPUs=32 RealMemory=70363
NodeName=<NameD> NodeAddr=<IPNodeD> State=UNKNOWN CPUs=32 RealMemory=70363

PartitionName=debug Nodes=<NodeA-D> Default=YES MaxTime=INFINITE State=UP

回答1:

If the running time does not depend on the value of the parameter in the Java application, there are two possible explanations:

Either your cgroup configuration does not confine your jobs and your Java code is multithreaded. In such case, if you run only one job, or if you run directly on the node, your single task uses several CPUs in parallel. If you run a job array that saturates the node, each task only can use a single CPU.

Or, your node is configured with hyper threading. In such case, if you run only one job, or if you run directly on the node, your single task can use a full CPU. If you run a job array that saturates the node, each task must share a physical CPU with another one.