Suppose I have two __device__
CUDA function, each having the following local variable:
__shared__ int a[123];
and another function (say it's my kernel, i.e. a __global__
function), with:
extern __shared__ int b[];
Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on __shared__
) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule?
This can be considered a follow-up question to this one.
The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:
Here, we have
384
bytes, which is3
arrays of32
ints. (see sample corde below).You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.
Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.
Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.
This code outputs the ptxas info using
384 bytes smem
, that is one array for globala
array, a second for dev1 methoda
array, and a third for dev2 methoda
array. Totalling3*32*sizeof(float)=384 bytes
.When running the kernel with dynamic shared memory equals to
32*sizeof(float)
, the pointer tob
starts right after these three arrays.EDIT: The ptx file generated by this code holds declarations of statically-defined shared memory,
except for the entry-point where it is defined in the body of the method
The shared space of the memory is defined in the PTX documentation here:
Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.
During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF
and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).
I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the
b
address does not change, and the unallocated shared memory region is never accessed).Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.