Running multiple worker daemons SLURM

2019-04-14 17:05发布

问题:

I want to run multiple worker daemons on single machine. As per damienfrancois's answer on what is the minimum number of computers for a slurm cluster it can be done. Problem is currently I am able to execute only 1 worker daemon on one machine. for example

When I run

sudo slurmd -N linux1 -cDvv
sudo slurmd -N linux2 -cDvv

linux1 goes down when I run linux2. Is it possible to run multiple worker daemons on one machine? Here is my slurm.conf file

回答1:

as your intention seems to be just testing the behavior of Slurm, I would recommend you to use the front-end mode, where you can create dummy computation nodes in the same machine.

In their FAQ, you have more details, but basically you must configure your installation to work with this mode:

./configure --enable-front-end  

And configure the nodes in slurm.conf

NodeName=test[1-100] NodeHostName=localhost

In that guide, they also explain how to launch more than one real daemons in the same node by changing the ports, but for my testing purposes it was not necessary.

Good luck!



回答2:

I got the same issue as you, I resolved it by modifying the paths of log files as mentioned there multiple slurmd support. In your slurm.conf for example

SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd

must be

SlurmdLogFile=/var/log/slurm/slurmd.%n.log
SlurmdPidFile=/var/run/slurmd.%n.pid
SlurmdSpoolDir=/var/spool/slurmd.%n

Now you can launch multiple slurmd.

Note : I tried with your slurm conf, I think some parameters are missing like define two NodeName instead of one and add which Port to use for each of Nodes. This works for me

# COMPUTE NODES
NodeName=linux[1-10] NodeHostname=linux0 Port=17004 CPUs=1 State=UNKNOWN
NodeName=linux[11-19] NodeHostname=linux0 Port=17005 CPUs=1 State=UNKNOWN
# PARTITIONS
PartitionName=main Nodes=linux1  Default=YES MaxTime=INFINITE State=UP
PartitionName=dev Nodes=linux11  Default=YES MaxTime=INFINITE State=UP