可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a front end and two compute nodes

All have same slurm.conf file which ends with (for detail please see: https://gist.github.com/avatar-lavventura/46b56cd3a29120594773ae1c8bc4b72c):

NodeName=ebloc2 NodeHostName=ebloc NodeAddr=54.227.62.43 CPUs=1
PartitionName=debug Nodes=ebloc2 Default=YES MaxTime=INFINITE State=UP

NodeName=ebloc4 NodeHostName=ebloc NodeAddr=54.236.173.82 CPUs=1
PartitionName=debug Nodes=ebloc4 Default=YES MaxTime=INFINITE State=UP

slurmctld: only checks first written nodes information and does not check the second written node's. When I try to send a job I recieve following error, it handles only first written node's IP and when I run sudo slurmd on the first node it works.

Error:

slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 54.227.62.43:6821: Connection refused
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 54.227.62.43:6821: Connection refused

The problem: compute node that I mentioned in the first order receives the jobs but the compute node I mentioned on the second order does not. How could I fix it.

slurmctld logs(https://gist.github.com/avatar-lavventura/4ec8c1b15e0ada4aa4bd0414e2b1ffb4)

Thank you for your valuable time and help.

回答1:

In the configuration file, try removing ControlAddr=127.0.0.1; or replacing with the IP address of ebloc. This 127.0.0.1 address basically means 'myself' and ControlAddr is used by slurmd to connect to the controller.

Remove also NodeHostName=localhost NodeAddr=127.0.0.1 for the same reason.

And make sure that ebloc and ebloc1 and ebloc2 are indeed what hostname -s returns on those machines.

Also make sure no firewall blocs the Slurm ports in any direction between those machines, and that SELinux is disabled or permissive. Make sure slurmd runs, as well as munge.

回答2:

You can only have one PartitionName line per partition. Remove the first one and put:

PartitionName = debug Nodes=ebloc2,ebloc4 Default=YES MaxTime=INFINITE State=UP

or use regexp:

PartitionName = debug Nodes=ebloc[2,4] Default=YES MaxTime=INFINITE State=UP

slurm: How to connect front-end with compute nodes

问题:

回答1:

回答2:

收藏的人(0)

slurm: How to connect front-end with compute nodes

问题:

回答1:

回答2:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮