I have been working on getting an application that relies on TensorFlow to work as a docker container with nvidia-docker
. I have compiled my application on top of the tensorflow/tensorflow:latest-gpu-py3
image. I run my docker container with the following command:
sudo nvidia-docker run -d -p 9090:9090 -v /src/weights:/weights myname/myrepo:mylabel
When looking at the logs through portainer
I see the following:
2017-05-16 03:41:47.715682: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.715896: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.715948: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.715978: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.716002: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.718076: E tensorflow/stream_executor/cuda/cuda_driver.cc:405] failed call to cuInit: CUDA_ERROR_UNKNOWN
2017-05-16 03:41:47.718177: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: 1e22bdaf82f1
2017-05-16 03:41:47.718216: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: 1e22bdaf82f1
2017-05-16 03:41:47.718298: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 367.57.0
2017-05-16 03:41:47.718398: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.57 Mon Oct 3 20:37:01 PDT 2016
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
"""
2017-05-16 03:41:47.718455: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 367.57.0
2017-05-16 03:41:47.718484: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 367.57.0
The container does seem to start properly, and my application does appear to be running. When I send requests to it for predictions the predictions are returned correctly - however at the slow speed I would expect when running inference on the CPU, so I think it's pretty clear that the GPU is not being used for some reason. I've also tried running nvidia-smi
from within that same container to make sure it is seeing my GPU and these are the results for that:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K1 Off | 0000:00:07.0 Off | N/A |
| N/A 28C P8 7W / 31W | 25MiB / 4036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
I'm certainly no expert in this - but it does appear that the GPU is visible from inside the container. Any ideas on how to get this working with TensorFlow?
Maybe the problem is related to JIT caching files permissions, created by GPU. On linux, by default, cache files were created at ~/.nv/ComputeCache. Setting another directory for JIT cache solves the problem. Just do
before running something on GPU.
I run tensorflow on my ubuntu16.04 desktop.
I run code with GPU works well days before. But today I cannot find gpu device with below code
import tensorflow as tf from tensorflow.python.client import device_lib as _device_lib with tf.Session() as sess: local_device_protos = _device_lib.list_local_devices() print(local_device_protos) [print(x.name) for x in local_device_protos]
And I realize the below issue , when I run
tf.Session()
I check my Nvidia driver in the system details, and
nvcc -V
,nvida-smi
to check driver ,cuda and cudnn. Everything seems well.Then I went to Additional Drivers to check driver detail, there I find there are many versions of the NVIDIA driver and the latest version selected. But when I first install the driver there is only one.
So I select a old version, and apply the change.
Then I run the
tf.Session()
the issue is also here. I think I should reboot my computer, after I rebooted it, this issue gone.sess = tf.Session() 2018-07-01 12:02:41.336648: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-07-01 12:02:41.464166: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-07-01 12:02:41.464482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.8225 pciBusID: 0000:01:00.0 totalMemory: 7.93GiB freeMemory: 7.27GiB 2018-07-01 12:02:41.464494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2018-07-01 12:02:42.308689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-07-01 12:02:42.308721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2018-07-01 12:02:42.308729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2018-07-01 12:02:42.309686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7022 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: