Verifying if GPU is actually used in Keras/Tensorf

I've just built a deep learning rig (AMD 12 core threadripper; GeForce RTX 2080 ti; 64Gb RAM). I originally wanted to install CUDnn and CUDA on Ubuntu 19.0, but the installation was too painful and after reading around a bit, I decided to switch to Windows 10...

After doing several installs of tensorflow-gpu, in and outside condas, I ran into further issues which I assumed was down to the CUDnn-CUDA-tensorflow compatibility, so uninstalled various versions of CUDA and tf. My output from nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:04_Central_Daylight_Time_2018
Cuda compilation tools, release 10.0, V10.0.130

Attached also nvidia-smi (which shows CUDA==11.0?!)

I also have:

 if tf.test.gpu_device_name():
        print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
        print("Please install GPU version of TF")
    print("keras version: {0} | Backend used: {1}".format(keras.__version__, backend.backend()))
    print("tensorflow version: {0} | Backend used: {1}".format(tf.__version__, backend.backend()))
    print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
    print("CUDA: {0} | CUDnn: {1}".format(tf_build_info.cuda_version_number,  tf_build_info.cudnn_version_number))

with output:

My device: [name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
incarnation: 12853915229880452239
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 9104897474

    cality {
      bus_id: 1
      links {
    incarnation: 7328135816345461398
    physical_device_desc: "device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:42:00.0, compute capability: 7.5"
    Default GPU Device: /device:GPU:0
    keras version: 2.3.1 | Backend used: tensorflow
    tensorflow version: 2.1.0 | Backend used: tensorflow
    Num GPUs Available:  1
    CUDA: 10.1 | CUDnn: 7

So (I hope) my installation has at least partly worked, I just still don't know whether the GPU is being used for my training, or if it's just recognised as existing, but the CPU is still being used. How can I differentiate this?

I also use pycharm. There was a recommendation for the installation of Visio Studio and an additional step here:

5. Include cudnn.lib in your Visual Studio project.
Open the Visual Studio project and right-click on the project name.
Click Linker > Input > Additional Dependencies.
Add cudnn.lib and click OK.

I didn't do this step. I also read that I need to set the following in environment variables, but my directory is empty:

SET PATH=C:\tools\cuda\bin;%PATH%

Could anyone verify this?

Also one my kera models requires a search for hyperparameters:

grid = GridSearchCV(estimator=model,
                        n_jobs=-1, # -1 for all cores

grid_result =, Y)

This works fine on my MBP (assuming of course the n_jobs=-1 takes all CPU cores). On my DL rig, I get warnings:

ERROR: The process with PID 5156 (child process of PID 1184) could not be terminated.
Reason: Access is denied.
ERROR: The process with PID 1184 (child process of PID 6920) could not be terminated.
Reason: There is no running instance of the task.
2020-03-28 20:29:48.598918: E tensorflow/stream_executor/cuda/] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.599348: E tensorflow/stream_executor/cuda/] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.599655: E tensorflow/stream_executor/cuda/] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.603023: E tensorflow/stream_executor/cuda/] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.603649: E tensorflow/stream_executor/cuda/] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.604236: E tensorflow/stream_executor/cuda/] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.604773: E tensorflow/stream_executor/cuda/] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.605524: E tensorflow/stream_executor/cuda/] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.608151: E tensorflow/stream_executor/cuda/] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.608369: W tensorflow/stream_executor/] attempting to perform BLAS operation using StreamExecutor without BLAS support
2020-03-28 20:29:48.608559: W tensorflow/core/common_runtime/] BaseCollectiveExecutor::StartAbort Internal: Blas GEMM launch failed : a.shape=(10, 8), b.shape=(8, 4), m=10, n=4, k=8
     [[{{node dense_1/MatMul}}]]
C:\Users\me\PycharmProjects\untitled\venv\lib\site-packages\sklearn\model_selection\ FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
tensorflow.python.framework.errors_impl.InternalError:  Blas GEMM launch failed : a.shape=(10, 8), b.shape=(8, 4), m=10, n=4, k=8
     [[node dense_1/MatMul (defined at C:\Users\me\PycharmProjects\untitled\venv\lib\site-packages\keras\backend\ ]] [Op:__inference_keras_scratch_graph_982]

Can I assume when using GridSearchCV, this utilises only the CPU, and not the GPU? Still, when running and timing another method in my code, I compare the MBP's time (approx 40s with 2,8 GHz Intel Core i7) compared to the Desktop's time (approx 43s with a 12 core threadripper). Even when comparing the CPUs I'd expect a far quicker time than the MBP. Is my assumption then wrong?


You can see the following details here.
Based on the documentation:

If a TensorFlow operation has both CPU and GPU implementations, 
by default, the GPU devices will be given priority when the operation is assigned to a device.
For example, tf.matmul has both CPU and GPU kernels. 
On a system with devices CPU:0 and GPU:0, the GPU:0 device will be selected to run tf.matmul unless you explicitly request running it on another device.

Logging device placement


# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

Example Result
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

For Manual Device placement


# Place tensors on the CPU
with tf.device('/GPU:0'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

c = tf.matmul(a, b)
Example Result: 
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)


Another way to analyse the performance of the GPU which I ended up finding (for Windows users) was to go to the "Task Manager" and change one of the Monitors in the "Performance" tab to CUDA, then simply run the script and watch it spike.

Also adding this

os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

before the keras import to toggle between CPU and GPU also shows a remarkable difference (although for my simple network, the quicker CPU can be explained here).