I'm trying to compile a program including a kernel with MSVS 2012 and CUDA. I use shared memory, but unlike in this question regarding the same problem, I only use my variable name for this kernel's shared memory once, so there's no issue of redefinition. With code like this:
template<typename T>
__global__ void mykernel(
const T* __restrict__ data,
T* __restrict__ results)
{
extern __shared__ T warp_partial_results[];
/* ... */
warp_partial_results[lane_id] = something;
/* ... */
results[something_else] = warp_partial_results[something_else];
/* ... */
}
which is instantiated for several types (e.g. float, int, unsigned int), I get the dreaded
declaration is incompatible with previous "warp_partial_results"
message. What could cause this?
CUDA doesn't immediately 'support' dynamically-allocated shared memory arrays in templated functions, as it (apparently) generates actual definitions of those extern's. If you instantiate a templated function for multiple types, the definitions would conflict.
A workaround is available in the form of template specialization via classes. See:
http://www.ecse.rpi.edu/~wrf/wiki/ParallelComputingSpring2015/cuda/nvidia/samples/0_Simple/simpleTemplates/sharedmem.cuh
You use the workaround as follows:
template<class T> __global__ void foo( T* g_idata, T* g_odata)
{
// shared memory
// the size is determined by the host application
SharedMem<T> shared;
T* sdata = shared.getPointer();
// .. the rest of the code remains unchanged!
}
the getPointer()
has* a specialized implementation for every type, which returns a different pointer, e.g. extern __shared__ float* shared_mem_float
or extern __shared__ int* shared_mem_int
etc.
(*) - Not really. in NVidia's supplied header file they specialize for some basic types and that's that.