I want to emulate SLURM on Ubuntu 16.04. I don't need serious resource management, I just want to test some simple examples. I cannot install SLURM in the usual way, and I am wondering if there are other options. Other things I have tried:
A Docker image. Unfortunately, docker pull agaveapi/slurm; docker run agaveapi/slurm
gives me errors:
/usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
2017-10-29 15:27:45,436 CRIT Supervisor running as root (no user in config file)
2017-10-29 15:27:45,437 INFO supervisord started with pid 1
2017-10-29 15:27:46,439 INFO spawned: 'slurmd' with pid 9
2017-10-29 15:27:46,441 INFO spawned: 'sshd' with pid 10
2017-10-29 15:27:46,443 INFO spawned: 'munge' with pid 11
2017-10-29 15:27:46,443 INFO spawned: 'slurmctld' with pid 12
2017-10-29 15:27:46,452 INFO exited: munge (exit status 0; not expected)
2017-10-29 15:27:46,452 CRIT reaped unknown pid 13)
2017-10-29 15:27:46,530 INFO gave up: munge entered FATAL state, too many start retries too quickly
2017-10-29 15:27:46,531 INFO exited: slurmd (exit status 1; not expected)
2017-10-29 15:27:46,535 INFO gave up: slurmd entered FATAL state, too many start retries too quickly
2017-10-29 15:27:46,536 INFO exited: slurmctld (exit status 0; not expected)
2017-10-29 15:27:47,537 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-29 15:27:47,537 INFO gave up: slurmctld entered FATAL state, too many start retries too quickly
This guide to start a SLURM VM via Vagrant. I tried, but copying over my munge
key timed out.
sudo scp /etc/munge/munge.key vagrant@server:/home/vagrant/
ssh: connect to host server port 22: Connection timed out
lost connection
So ... we have an existing cluster here but it runs an older Ubuntu version which does not mesh well with my workstation running 17.04.
So on my workstation, I just made sure I slurmctld
(backend) and slurmd
installed, and then set up a trivial slurm.conf
with
ControlMachine=mybox
# ...
NodeName=DEFAULT CPUs=4 RealMemory=4000 TmpDisk=50000 State=UNKNOWN
NodeName=mybox CPUs=4 RealMemory=16000
after which I restarted slurmcltd
and then slurmd
. Now all is fine:
root@mybox:/etc/slurm-llnl$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
demo up infinite 1 idle mybox
root@mybox:/etc/slurm-llnl$
This is a degenerate setup, our real one has a mix of dev and prod machine and appropriate partitions. But this should answer your "can backend really be client" question. Also, my machine is not really called mybox
but is not really pertinent for the question in either case.
Using Ubuntu 17.04, all stock, with munge
to communicate (which is the default anyway).
Edit: To wit:
me@mybox:~$ COLUMNS=90 dpkg -l '*slurm*' | grep ^ii
ii slurm-client 16.05.9-1ubun amd64 SLURM client side commands
ii slurm-wlm-basic- 16.05.9-1ubun amd64 SLURM basic plugins
ii slurmctld 16.05.9-1ubun amd64 SLURM central management daemon
ii slurmd 16.05.9-1ubun amd64 SLURM compute node daemon
me@mybox:~$
I would still prefer to run SLURM natively, but I caved and spun up a Debian 9.2 VM. See here for my efforts to troubleshoot a native installation. The directions here worked smoothly, but I needed to make the following changes to slurm.conf
. Below, Debian64
is the hostname
, and wlandau
is my user name.
ControlMachine=Debian64
SlurmUser=wlandau
NodeName=Debian64
Here is the complete slurm.conf
. An analogous slurm.conf
did not work on my native Ubuntu 16.04.
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=Debian64
#ControlAddr=
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/usr/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=wlandau
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=Debian64 CPUs=1 RealMemory=744 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=Debian64 Default=YES MaxTime=INFINITE State=UP