Google Cloud AI Platform Notebook Instance won'

2019-09-20 09:16发布

I'm using the pre-built AI Platform Jupyter Notebook instances to train a model with a single Tesla K80 card. The issue is that I don't believe the model is actually training on the GPU.

nvidia-smi returns the following during training:

No Running Processes Found

Not the "No Running Process Found" yet "Volatile GPU Usage" is 100%. Something seems strange...

...And the training is excruciatingly slow.

A few days ago, I was having issues with the GPU not being released after each notebook run. When this occurred I would receive a OOM (Out of memory error). This required me to go into the console every time, find the GPU running process PID and use kill -9 before re-running the notebook. However, today, I can't get the GPU to run at all? It never shows a running process.

I've tried 2 different GCP AI Platform Notebook instances (both of the available tensorflow version options) with no luck. Am I missing something with these "pre-built" instances.

Pre-Built AI Platform Notebook Section

Just to clarify, I did not build my own instance and then install access to Jupyter notebooks. Instead, I used the built-in Notebook instance option under the AI Platform submenu.

Do I still need to configure a setting somewhere or install a library to continue using/reset my chosen GPU? I was under the impression that the virtual machine was already loaded with the Nvidia stack and should be plug and play with GPUs.

Thoughts?

EDIT: Here is a full video of the issue as requested --> https://www.youtube.com/watch?v=N5Zx_ZrrtKE&feature=youtu.be

1条回答
戒情不戒烟
2楼-- · 2019-09-20 09:31

Generally speaking, you'll want to try to debug issues like this using the smallest possible bit of code that could reproduce your error. That removes many possible causes for the issue you're seeing.

In this case, you can check if your GPUs are being used by running this code (copied from the TensorFlow 2.0 GPU instructions):

import tensorflow as tf
print("GPU Available: ", tf.test.is_gpu_available())

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

Running it on the same TF 2.0 Notebook gives me the output:

GPU Available:  True
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

That right there shows that it's using the GPUs

Similarly, if you need more evidence, running nvidia-smi gives the output:

jupyter@tf2:~$ nvidia-smi
Tue Jul 30 00:59:58 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    58W / 149W |  10900MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7852      C   /usr/bin/python3                           10887MiB |
+-----------------------------------------------------------------------------+

So why isn't your code using GPUs? You're using a library someone else wrote, probably for tutorial purposes. Most likely those library functions are doing something that is causing CPUs to be used instead of GPUs.

You'll want to debug that code directly.

查看更多
登录 后发表回答