slurm: How to connect front-end with compute nodes

2019-05-27 16:35发布

问题:

I have a front end and two compute nodes

All have same slurm.conf file which ends with (for detail please see: https://gist.github.com/avatar-lavventura/46b56cd3a29120594773ae1c8bc4b72c):

NodeName=ebloc2 NodeHostName=ebloc NodeAddr=54.227.62.43 CPUs=1
PartitionName=debug Nodes=ebloc2 Default=YES MaxTime=INFINITE State=UP

NodeName=ebloc4 NodeHostName=ebloc NodeAddr=54.236.173.82 CPUs=1
PartitionName=debug Nodes=ebloc4 Default=YES MaxTime=INFINITE State=UP

slurmctld: only checks first written nodes information and does not check the second written node's. When I try to send a job I recieve following error, it handles only first written node's IP and when I run sudo slurmd on the first node it works.

Error:

slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 54.227.62.43:6821: Connection refused
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 54.227.62.43:6821: Connection refused

The problem: compute node that I mentioned in the first order receives the jobs but the compute node I mentioned on the second order does not. How could I fix it.

slurmctld logs(https://gist.github.com/avatar-lavventura/4ec8c1b15e0ada4aa4bd0414e2b1ffb4)

Thank you for your valuable time and help.

回答1:

In the configuration file, try removing ControlAddr=127.0.0.1; or replacing with the IP address of ebloc. This 127.0.0.1 address basically means 'myself' and ControlAddr is used by slurmd to connect to the controller.

Remove also NodeHostName=localhost NodeAddr=127.0.0.1 for the same reason.

And make sure that ebloc and ebloc1 and ebloc2 are indeed what hostname -s returns on those machines.

Also make sure no firewall blocs the Slurm ports in any direction between those machines, and that SELinux is disabled or permissive. Make sure slurmd runs, as well as munge.



回答2:

You can only have one PartitionName line per partition. Remove the first one and put:

PartitionName = debug Nodes=ebloc2,ebloc4 Default=YES MaxTime=INFINITE State=UP

or use regexp:

PartitionName = debug Nodes=ebloc[2,4] Default=YES MaxTime=INFINITE State=UP


标签: slurm