I have a plan to use distributed TensorFlow, and I saw TensorFlow can use GPUs for training and testing. In a cluster environment, each machine could have 0 or 1 or more GPUs, and I want to run my TensorFlow graph into GPUs on as many machines as possible.
I found that when running tf.Session()
TensorFlow gives information about GPU in the log messages like below:
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
My question is how do I get information about current available GPU from TensorFlow? I can get loaded GPU information from the log, but I want to do it in a more sophisticated, programmatic way. I also could restrict GPUs intentionally using the CUDA_VISIBLE_DEVICES environment variable, so I don't want to know a way of getting GPU information from OS kernel.
In short, I want a function like tf.get_available_gpus()
that will return ['/gpu:0', '/gpu:1']
if there are two GPUs available in the machine. How can I implement this?
There is also a method in the test util. So all that has to be done is:
and/or
Look up the Tensorflow docs for arguments.
You can check all device list using following code:
There is an undocumented method called
device_lib.list_local_devices()
that enables you to list the devices available in the local process. (N.B. As an undocumented method, this is subject to backwards incompatible changes.) The function returns a list ofDeviceAttributes
protocol buffer objects. You can extract a list of string device names for the GPU devices as follows:Note that (at least up to TensorFlow 1.4), calling
device_lib.list_local_devices()
will run some initialization code that, by default, will allocate all of the GPU memory on all of the devices (GitHub issue). To avoid this, first create a session with an explicitly smallper_process_gpu_fraction
, orallow_growth=True
, to prevent all of the memory being allocated. See this question for more details.The accepted answer gives you the number of GPUs but it also allocates all the memory on those GPUs. You can avoid this by creating a session with fixed lower memory before calling device_lib.list_local_devices() which may be unwanted for some applications.
I ended up using nvidia-smi to get the number of GPUs without allocating any memory on them.
Apart from the excellent explanation by Mrry, where he suggested to use
device_lib.list_local_devices()
I can show you how you can check for GPU related information from the command line.Because currently only Nvidia's gpus work for NN frameworks, the answer covers only them. Nvidia has a page where they document how you can use the /proc filesystem interface to obtain run-time information about the driver, any installed NVIDIA graphics cards, and the AGP status.
So you can run this from command line
cat /proc/driver/nvidia/gpus/0/information
and see information about your first GPU. It is easy to run this from python and also you can check second, third, fourth GPU till it will fail.Definitely Mrry's answer is more robust and I am not sure whether my answer will work on non-linux machine, but that Nvidia's page provide other interesting information, which not many people know about.