could not create cudnn handle: CUDNN_STATUS_INTERN

2020-02-10 13:07发布

问题:

I installed tensorflow 1.0.1 GPU version on my Macbook Pro with GeForce GT 750M. Also installed CUDA 8.0.71 and cuDNN 5.1. I am running a tf code that works fine with non CPU tensorflow but on GPU version, I get this error (once a while it works too):

name: GeForce GT 750M
major: 3 minor: 0 memoryClockRate (GHz) 0.9255
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 67.48MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 67.48M (70754304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Training...

E tensorflow/stream_executor/cuda/cuda_dnn.cc:397] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:364] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 
Abort trap: 6

What is happening here? Is this a bug in tensorflow. Please help.

Here are GPU memory space when I run the python code:

Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 91.477 of 2047.6 MB (i.e. 4.47%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 22.852 of 2047.6 MB (i.e. 1.12%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 22.852 of 2047.6 MB (i.e. 1.12%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 36.121 of 2047.6 MB (i.e. 1.76%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 71.477 of 2047.6 MB (i.e. 3.49%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free

回答1:

I have managed to get it working by deleting the .nv folder in my home folder:

sudo rm -rf ~/.nv/


回答2:

In my case, after checking the cuDNN and CUDA version, I found my GPU was out of memory. Using watch -n 0.1 nvidia-smi in another bash terminal, the moment 2019-07-16 19:54:05.122224: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR onset is the moment GPU memory nearly full. The screenshot

So I configure a limit for tnsorflow to use my gpu. As I use tf.keras module, I add the following code to the beginning of my program:

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
tf.keras.backend.set_session(tf.Session(config=config));

Then, problem solved!

You can change your batch_size or using smarter ways to input your training data (such as tf.data.Dataset and using cache). I hope my answer can help someone else.



回答3:

In Tensorflow 2.0, my issue was resolved by setting the memory growth. ConfigProto is deprecated in TF 2.0, I used tf.config.experimental. My computer specs are:

  • OS: Ubuntu 18.04
  • GPU: GeForce RTX 2070
  • Nvidia Driver: 430.26
  • Tensorflow: 2.0
  • Cudnn: 7.6.2
  • Cuda: 10.0

The code I used was:

physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)


回答4:

As strange as this may sound, try restarting your computer and rerun your model. If the model runs fine the issue is with your GPU memory allocation and tensorflows management of that available memory. On windows 10 i had two terminals open and closing one solved my problem. There could be open threads (zombie) that are still holding memory.



回答5:

In my case it seems that the problem was caused by tensorflow and cudnn version mismatch. The following helped me (I was working on Ubuntu 16.04 with NVidia Tesla K80 on Google Cloud, tensorflow 1.5 finally worked with cudnn 7.0.4 and cuda 9.0):

  1. Remove cuDNN completely:

    sudo rm /usr/local/cuda/include/cudnn.h
    sudo rm /usr/local/cuda/lib64/libcudnn*
    

    After doing so import tensorflow should cause error.

  2. Download appropriate cuDNN version. Note that there is cuDNN 7.0.4 for CUDA 9.0 and cuDNN 7.0.4 for CUDA 8.0. You should choose the one corresponding to your CUDA version. Be careful at this step or you'll get similar problem again. Install cuDNN as usual:

    tar -xzvf cudnn-9.0-linux-x64-v7.tgz
    cd cuda
    sudo cp -P include/cudnn.h /usr/include
    sudo cp -P lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
    sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*
    

    In this example I've installed cuDNN 7.0.x for CUDA 9.0 (x actually doesn't matter). Take care to match your CUDA version.

  3. Restart the computer. In my case the problem vanished. If the error still occurs consider installing another version of tensorflow.

Hope this helps someone.



回答6:

I also get same error, and I resolved the issue. My system properties were as follows:

  • Operating System: Ubuntu 14.04
  • GPU: GTX 1050Ti
  • Nvidia Driver: 375.66
  • Tensorflow: 1.3.0
  • Cudnn: 6.0.21 (cudnn-8.0-linux-x64-v6.0.deb)
  • Cuda: 8.0.61
  • Keras: 2.0.8

How I solved the issue is as follows:

  1. I copied cudnn files to appropriate locations (/usr/local/cuda/include and /usr/local/cuda/lib64)
  2. I set the environment variables as:

    * export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
    * export CUDA_HOME=/usr/local/cuda
    
  3. I also run sudo ldconfig -v command to cache the shared libraries for run time linker.

I hope those steps will also help someone who is about to go crazy.



回答7:

Adding following code worked for me:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

In my env there is no mismatch between CuDNN and Cuda versions. OS: ubuntu-18.04; Tensorflow: 1.14; CuDNN: 7.6; cuda: 10.1 (418.87.00).



回答8:

This is cudnn compatible issue. Check what you installed that is using the GPU for instance, tensorflow-gpu. What is the version? Is the version compatible with the version of your cudnn and is the cudnn installed the right version for your cuda?.

I have observed that: cuDNN v7.0.3 for Cuda 7.* cuDNN v7.1.2 for Cuda 9.0 cuDNN v7.3.1 for Cuda 9.1 and so on.

So also check the correct version of TensorFlow for your cuda configurations. For instance -using tensorflow-gpu: TF v1.4 for cudnn 7.0.* TF v1.7 and above for cudnn 9.0.*, etc.

So all you need to do is to reinstall the appropriate cudnn version. Hope it helps!



回答9:

For anyone getting this issue in Jupyter notebook:

I was running two jupyter notebooks. After closing one of them the issue was solved.



回答10:

I too encountered the same problem:

Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1050
major: 6 minor: 1 memoryClockRate (GHz) 1.493 pciBusID 0000:01:00.0
Total memory: 3.95GiB
Free memory: 3.60GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:532] Check failed:  stream->parent()->GetConvolveAlgorithms(&algorithms)

Aborted (core dumped)

But in my case using sudo with the command worked perfectly fine.



回答11:

I encountered this problem when I accidently installed the CUDA 9.2 libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb instead of libcudnn7_7.0.5.15-1+cuda9.0_amd64.deb on a system with CUDA 9.0 installed.

I got there because I had CUDA 9.2 installed and I had downgraded to CUDA 9.0, and evidently libcudnn is specific to versions.



回答12:

For me, re-running the CUDA installation as described here solved the problem:

# Add NVIDIA package repository
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
sudo apt install ./cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt update

# Install CUDA and tools. Include optional NCCL 2.x
sudo apt install cuda9.0 cuda-cublas-9-0 cuda-cufft-9-0 cuda-curand-9-0 \
    cuda-cusolver-9-0 cuda-cusparse-9-0 libcudnn7=7.2.1.38-1+cuda9.0 \
    libnccl2=2.2.13-1+cuda9.0 cuda-command-line-tools-9-0

During the installation apt-get downgraded cudnn7 which I think is the culprit here. Probably it got updated accidentally with apt-get upgrade to a version which is incompatible with some other piece of the system.



回答13:

Please remember to close your tensorboard terminal/cmd or other terminals, that have interactions to/with the directory. Then you can restart the training at it should work.



回答14:

It has to do with the memory fraction available to load GPU resources to create cudnn handle, also known as per_process_gpu_memory_fraction. Reducing this memory fraction by yourself will solve the error.

> sess_config = tf.ConfigProto(gpu_options =
> tf.GPUOptions(per_process_gpu_memory_fraction=0.7),
> allow_soft_placement = True)
> 
> with tf.Session(config=sess_config) as sess:
>      sess.run([whatever])

Use as small fraction as could fit in your memory. (In the code, I use 0.7, you can start with 0.3 or even smaller, then increase until you get the same error, that's your limit.) Pass it to your tf.Session() or tf.train.MonitoredTrainingSession() or Supervisor's sv.managed_session() as config.

This should allow your GPU create a cudnn handle for your TensorFlow code.