I have a front end and two compute nodes
All have same slurm.conf file which ends with (for detail please see: https://gist.github.com/avatar-lavventura/46b56cd3a29120594773ae1c8bc4b72c):
NodeName=ebloc2 NodeHostName=ebloc NodeAddr=54.227.62.43 CPUs=1
PartitionName=debug Nodes=ebloc2 Default=YES MaxTime=INFINITE State=UP
NodeName=ebloc4 NodeHostName=ebloc NodeAddr=54.236.173.82 CPUs=1
PartitionName=debug Nodes=ebloc4 Default=YES MaxTime=INFINITE State=UP
slurmctld
: only checks first written nodes information and does not check the second written node's. When I try to send a job I recieve following error, it handles only first written node's IP and when I run sudo slurmd
on the first node it works.
Error:
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 54.227.62.43:6821: Connection refused
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 54.227.62.43:6821: Connection refused
The problem: compute node that I mentioned in the first order receives the jobs but the compute node I mentioned on the second order does not. How could I fix it.
slurmctld logs(https://gist.github.com/avatar-lavventura/4ec8c1b15e0ada4aa4bd0414e2b1ffb4)
Thank you for your valuable time and help.