I could get access to a computing cluster, specifically one node with two 12-Core CPUs, which is running with Slurm Workload Manager.
I would like to run TensorFlow on that system but unfortunately I were not able to find any information about how to do this or if this is even possible. I am new to this but as far as I understand it, I would have to run TensorFlow by creating a Slurm job and can not directly execute python/tensorflow via ssh.
Has anyone an idea, tutorial or any kind of source on this topic?
It's relatively simple.
Under the simplifying assumptions that you request one process per host, slurm will provide you with all the information you need in environment variables, specifically SLURM_PROCID, SLURM_NPROCS and SLURM_NODELIST.
For example, you can initialize your task index, the number of tasks and the nodelist as follows:
Note that slurm gives you a host list in its compressed format (e.g., "myhost[11-99]"), that you need to expand. I do that with module hostlist by Kent Engström, available here https://pypi.python.org/pypi/python-hostlist
At that point, you can go right ahead and create your TensorFlow cluster specification and server with the information you have available, e.g.:
And you're set! You can now perform TensorFlow node placement on a specific host of your allocation with the usual syntax:
A flaw with the code reported above is that all your jobs will instruct Tensorflow to install servers listening at fixed port 22222. If multiple such jobs happen to be scheduled to the same node, the second one will fail to listen to 22222.
A better solution is to let slurm reserve ports for each job. You need to bring your slurm administrator on board and ask him to configure slurm so it allows you to ask for ports with the --resv-ports option. In practice, this requires asking them to add a line like the following in their slurm.conf:
Before you bug your slurm admin, check what options are already configured, e.g., with:
If your site already uses an old version of OpenMPI, there's a chance an option like this is already in place.
Then, amend my first snippet of code as follows:
Good luck!
You can simply pass a batch script to slurm with the
sbatch
command like suchlisting available partitions can be done with
sinfo
.start.sh (possible configuration):
whereas run.py contains the script you want to be executed with slurm i.e. your tensorflow code.
You can look up the details here: https://slurm.schedmd.com/sbatch.html