Consider the following Python code:
from numpy import float64
from pycuda import compiler, gpuarray
import pycuda.autoinit
# N > 960 is crucial!
N = 961
code = """
__global__ void kern(double *v)
{
double a = v[0]*v[2];
double lmax = fmax(0.0, a), lmin = fmax(0.0, -a);
double smax = sqrt(lmax), smin = sqrt(lmin);
if(smax > 0.2) {
smax = fmin(smax, 0.2)/smax ;
smin = (smin > 0.0) ? fmin(smin, 0.2)/smin : 0.0;
smin = lmin + smin*a;
v[0] = v[0]*smin + smax*lmax;
v[2] = v[2]*smin + smax*lmax;
}
}
"""
kernel_func = compiler.SourceModule(code).get_function("kern")
kernel_func(gpuarray.zeros(3, float64), block=(N,1,1))
Executing this gives:
Traceback (most recent call last):
File "test.py", line 25, in <module>
kernel_func(gpuarray.zeros(3, float64), block=(N,1,1))
File "/usr/lib/python3.5/site-packages/pycuda/driver.py", line 402, in function_call
func._launch_kernel(grid, block, arg_buf, shared, None)
pycuda._driver.LaunchError: cuLaunchKernel failed: too many resources requested for launch
My setup: Python v3.5.2 with pycuda==2016.1.2 and numpy==1.11.1 on Ubuntu 16.04.1 (64-bit), kernel 4.4.0, nvcc V7.5.17. The graphics card is an Nvidia GeForce GTX 480.
Can you reproduce this on your machine? Do you have any idea, what causes this error message?
Remark: I know that, in principle, there is a race condition because all kernels try to change v[0] and v[2]. But the kernels shouldn't reach the inside of the if-block anyway! Moreover, I'm able to reproduce the error without the race condition, but it's much more complicated.