When we want to use distributed TensorFlow, we will create a parameter server using
tf.train.Server.join()
However, I can't find any way to shut down the server except killing the processing. The TensorFlow documentation for join() is
Blocks until the server has shut down.
This method currently blocks forever.
This is quite bothering to me because I would like to create many servers for computation and shut them down when everything finishes.
Is there possible solutions for this.
Thanks
This page appears pretty often on Google, so I thought I would try to improve on Yaroslav's answer by providing what I hope is a more clear answer for those just getting into distributed Tensorflow.
It's pretty simple to extend upon the "canonical" Distributed Tensorflow example by replacing the worker section of the code with this snippet:
Note that the MonitoredTrainingSession version seems to be much slower at connecting all of the workers together.
You can have parameter server processes die on demand by using
session.run(dequeue_op)
instead ofserver.join()
and having another process enqueue something onto that queue when you want this process to die.So for
k
parameter server shards you could createk
queues, with uniqueshared_name
property and try todequeue
from that queue. When you want to bring down the servers, you loop over all queues andenqueue
a token onto each queue. This would causesession.run
to unblock and Python process will run to the end and quit, bringing down the server.Below is a self-contained example with 2 shards taken from: https://gist.github.com/yaroslavvb/82a5b5302449530ca5ff59df520c369e
(for multi worker/multi shard example, see https://gist.github.com/yaroslavvb/ea1b1bae0a75c4aae593df7eca72d9ca)
There's currently no clean way to shut down a TensorFlow gRPC server. It is possible to shut down a gRPC server, but doing it safely requires additional memory management for all of the in-flight request and response buffers, which would require a lot of additional plumbing (of the worst kind: asynchronous shared memory management...) for a feature that nobody had requested—until now!
In practice you should be able to use the same
tf.train.Server
object for many different computations. If this doesn't work for your use case, please feel free to open an GitHub issue and tell us more about your use case.