How to get current available GPUs in tensorflow?

2019-01-03 08:45发布

I have a plan to use distributed TensorFlow, and I saw TensorFlow can use GPUs for training and testing. In a cluster environment, each machine could have 0 or 1 or more GPUs, and I want to run my TensorFlow graph into GPUs on as many machines as possible.

I found that when running tf.Session() TensorFlow gives information about GPU in the log messages like below:

I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)

My question is how do I get information about current available GPU from TensorFlow? I can get loaded GPU information from the log, but I want to do it in a more sophisticated, programmatic way. I also could restrict GPUs intentionally using the CUDA_VISIBLE_DEVICES environment variable, so I don't want to know a way of getting GPU information from OS kernel.

In short, I want a function like tf.get_available_gpus() that will return ['/gpu:0', '/gpu:1'] if there are two GPUs available in the machine. How can I implement this?

5条回答
对你真心纯属浪费
2楼-- · 2019-01-03 09:10

There is also a method in the test util. So all that has to be done is:

tf.test.is_gpu_available()

and/or

tf.test.gpu_device_name()

Look up the Tensorflow docs for arguments.

查看更多
仙女界的扛把子
3楼-- · 2019-01-03 09:14

You can check all device list using following code:

from tensorflow.python.client import device_lib

device_lib.list_local_devices()
查看更多
beautiful°
4楼-- · 2019-01-03 09:34

There is an undocumented method called device_lib.list_local_devices() that enables you to list the devices available in the local process. (N.B. As an undocumented method, this is subject to backwards incompatible changes.) The function returns a list of DeviceAttributes protocol buffer objects. You can extract a list of string device names for the GPU devices as follows:

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

Note that (at least up to TensorFlow 1.4), calling device_lib.list_local_devices() will run some initialization code that, by default, will allocate all of the GPU memory on all of the devices (GitHub issue). To avoid this, first create a session with an explicitly small per_process_gpu_fraction, or allow_growth=True, to prevent all of the memory being allocated. See this question for more details.

查看更多
再贱就再见
5楼-- · 2019-01-03 09:35

The accepted answer gives you the number of GPUs but it also allocates all the memory on those GPUs. You can avoid this by creating a session with fixed lower memory before calling device_lib.list_local_devices() which may be unwanted for some applications.

I ended up using nvidia-smi to get the number of GPUs without allocating any memory on them.

import subprocess

n = str(subprocess.check_output(["nvidia-smi", "-L"])).count('UUID')
查看更多
The star\"
6楼-- · 2019-01-03 09:36

Apart from the excellent explanation by Mrry, where he suggested to use device_lib.list_local_devices() I can show you how you can check for GPU related information from the command line.

Because currently only Nvidia's gpus work for NN frameworks, the answer covers only them. Nvidia has a page where they document how you can use the /proc filesystem interface to obtain run-time information about the driver, any installed NVIDIA graphics cards, and the AGP status.

/proc/driver/nvidia/gpus/0..N/information

Provide information about each of the installed NVIDIA graphics adapters (model name, IRQ, BIOS version, Bus Type). Note that the BIOS version is only available while X is running.

So you can run this from command line cat /proc/driver/nvidia/gpus/0/information and see information about your first GPU. It is easy to run this from python and also you can check second, third, fourth GPU till it will fail.

Definitely Mrry's answer is more robust and I am not sure whether my answer will work on non-linux machine, but that Nvidia's page provide other interesting information, which not many people know about.

查看更多
登录 后发表回答