I am trying to use (and learn from) Mark Harris's optimized reduction kernel, by copying his source code into a simple pycuda application (full source of my attempt is listed below).
Unfortunately, I run into one of the two following erros.
The cuda kernel does not compile, throwing the following error message.
kernel.cu(3): error: this declaration may not have extern "C" linkage
If I include the argument
no_extern_c=True
into the line that compiles the kernel, the following error is raised:pycuda._driver.LogicError: cuModuleGetFunction failed: not found
I have also tried wrapping the contents of modStr in extern "C" { [...] }
with the no_extern_c
variable set to either True or False, without any success.
The problem appears to involve the line template <unsigned int blockSize>
as if I comment the body of the function out it still raises errors. But I don't understand the problem well enough to have any more ideas about how to fix it.
Any advice / suggestions / help would be much appreciated -- thanks in advance!
from pylab import *
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
modStr = """
template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + tid;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize;
}
__syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32) {
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
"""
mod = SourceModule(modStr,no_extern_c=True)
# With no_extern_c = True, the error is :
# pycuda._driver.LogicError: cuModuleGetFunction failed: not found
# With no_extern_c = False, the error is :
# kernel.cu(3): error: this declaration may not have extern "C" linkage
cuda_reduce_fn = mod.get_function("reduce6")
iData = arange(32).astype(np.float32)
oData = zeros_like(iData)
cuda_reduce_fn(
drv.In(iData),
drv.Out(oData),
np.int32(32),
block=(32,1,1), grid=(1,1))
print(iData)
print(oData)