CUDA runtime error (59) : device-side assert trigg

2020-02-28 03:10发布

问题:

I have access to Tesla K20c, I am running ResNet50 on CIFAR10 dataset... Then I get the error as:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
File "main.py", line 109, in <module>
train(loader_train, model, criterion, optimizer)
File "main.py", line 54, in train optimizer.step()
File "/usr/local/anaconda35/lib/python3.6/site-packages/torch/optim/sgd.py", line 93, in step
d_p.add_(weight_decay, p.data) RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:265
How to resolve this error

回答1:

In general, when encountering cuda runtine errors, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1 flag to obtain an accurate stack trace.

In your specific case, the targets of your data were too high (or low) for the specified number of classes.



回答2:

I have encountered this problem several times. And I find it to be an index issue. For example, if your ground truth label starts at 1: target = [1,2,3,4,5], then you should subtract 1 for every label, change it to: [0,1,2,3,4]. This solves my problem every time.



标签: gpu pytorch