Running TensorFlow on a Slurm Cluster?

I could get access to a computing cluster, specifically one node with two 12-Core CPUs, which is running with Slurm Workload Manager.

I would like to run TensorFlow on that system but unfortunately I were not able to find any information about how to do this or if this is even possible. I am new to this but as far as I understand it, I would have to run TensorFlow by creating a Slurm job and can not directly execute python/tensorflow via ssh.

Has anyone an idea, tutorial or any kind of source on this topic?

回答1:

It's relatively simple.

Under the simplifying assumptions that you request one process per host, slurm will provide you with all the information you need in environment variables, specifically SLURM_PROCID, SLURM_NPROCS and SLURM_NODELIST.

For example, you can initialize your task index, the number of tasks and the nodelist as follows:

from hostlist import expand_hostlist
task_index  = int( os.environ['SLURM_PROCID'] )
n_tasks     = int( os.environ['SLURM_NPROCS'] )
tf_hostlist = [ ("%s:22222" % host) for host in
                expand_hostlist( os.environ['SLURM_NODELIST']) ]

Note that slurm gives you a host list in its compressed format (e.g., "myhost[11-99]"), that you need to expand. I do that with module hostlist by Kent Engström, available here https://pypi.python.org/pypi/python-hostlist

At that point, you can go right ahead and create your TensorFlow cluster specification and server with the information you have available, e.g.:

cluster = tf.train.ClusterSpec( {"your_taskname" : tf_hostlist } )
server  = tf.train.Server( cluster.as_cluster_def(),
                           job_name   = "your_taskname",
                           task_index = task_index )

And you're set! You can now perform TensorFlow node placement on a specific host of your allocation with the usual syntax:

for idx in range(n_tasks):
   with tf.device("/job:your_taskname/task:%d" % idx ):
       ...

A flaw with the code reported above is that all your jobs will instruct Tensorflow to install servers listening at fixed port 22222. If multiple such jobs happen to be scheduled to the same node, the second one will fail to listen to 22222.

A better solution is to let slurm reserve ports for each job. You need to bring your slurm administrator on board and ask him to configure slurm so it allows you to ask for ports with the --resv-ports option. In practice, this requires asking them to add a line like the following in their slurm.conf:

MpiParams=ports=15000-19999

Before you bug your slurm admin, check what options are already configured, e.g., with:

scontrol show config | grep MpiParams

If your site already uses an old version of OpenMPI, there's a chance an option like this is already in place.

Then, amend my first snippet of code as follows:

from hostlist import expand_hostlist
task_index  = int( os.environ['SLURM_PROCID'] )
n_tasks     = int( os.environ['SLURM_NPROCS'] )
port        = int( os.environ['SLURM_STEP_RESV_PORTS'].split('-')[0] )
tf_hostlist = [ ("%s:%s" % (host,port)) for host in
                expand_hostlist( os.environ['SLURM_NODELIST']) ]

Good luck!

回答2:

You can simply pass a batch script to slurm with the sbatch command like such

sbatch --partition=part start.sh

listing available partitions can be done with sinfo.

start.sh (possible configuration):

#!/bin/sh
#SBATCH -N 1      # nodes requested
#SBATCH -n 1      # tasks requested
#SBATCH -c 10      # cores requested
#SBATCH --mem=32000  # memory in Mb
#SBATCH -o outfile  # send stdout to outfile
#SBATCH -e errfile  # send stderr to errfile
python run.py

whereas run.py contains the script you want to be executed with slurm i.e. your tensorflow code.

You can look up the details here: https://slurm.schedmd.com/sbatch.html