PyCUDA either fails to find function in NVIDIA sou

I am trying to use (and learn from) Mark Harris's optimized reduction kernel, by copying his source code into a simple pycuda application (full source of my attempt is listed below).

Unfortunately, I run into one of the two following erros.

The cuda kernel does not compile, throwing the following error message.

kernel.cu(3): error: this declaration may not have extern "C" linkage

If I include the argument no_extern_c=True into the line that compiles the kernel, the following error is raised:
```
pycuda._driver.LogicError: cuModuleGetFunction failed: not found
```

I have also tried wrapping the contents of modStr in extern "C" { [...] } with the no_extern_c variable set to either True or False, without any success.

The problem appears to involve the line template <unsigned int blockSize> as if I comment the body of the function out it still raises errors. But I don't understand the problem well enough to have any more ideas about how to fix it.

Any advice / suggestions / help would be much appreciated -- thanks in advance!

from pylab import *

import pycuda.gpuarray as gpuarray
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule

modStr = """
template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + tid;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;
    while (i < n) { 
        sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; 
    }
    __syncthreads();
    if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
    if (tid < 32) {
        if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
        if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
        if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
        if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
        if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
        if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
    }
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
"""

mod = SourceModule(modStr,no_extern_c=True) 
# With no_extern_c = True, the error is :
# pycuda._driver.LogicError: cuModuleGetFunction failed: not found
# With no_extern_c = False, the error is :
# kernel.cu(3): error: this declaration may not have extern "C" linkage


cuda_reduce_fn = mod.get_function("reduce6")
iData = arange(32).astype(np.float32)
oData = zeros_like(iData)

cuda_reduce_fn(
    drv.In(iData),
    drv.Out(oData),
    np.int32(32),
    block=(32,1,1), grid=(1,1))

print(iData)
print(oData)

It is illegal to have templated functions with C linkage in C++, which is why you get the error in the first case.

In the second case, you get a not found error because you haven't actually instantiated the template anywhere I can see, so the compiler won't emit any output.

When you do add an instance, you will get the same error, because the compiled code object for the device has a mangled name. You will need to use the mangled name in the get_function call. Paradoxically, you can't know the mangled name when JIT compiling from source, because you need to see the compiler output and that isn't know a priori (any of compiler messages, PTX, cubin or object files will give you the mangled name).

If you want to work with templated kernels in PyCUDA, I recommend compiling them to cubin yourself with the toolchain, and then loading from cubin in PyCUDA to get known mangled names from the module.