-->

Using both dynamically-allocated and statically-al

2019-02-20 22:48发布

问题:

Suppose I have two __device__ CUDA function, each having the following local variable:

__shared__ int a[123];

and another function (say it's my kernel, i.e. a __global__ function), with:

extern __shared__ int b[];

Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on __shared__) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule?

This can be considered a follow-up question to this one.

回答1:

The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:

  ptxas info    : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]

Here, we have 384 bytes, which is 3 arrays of 32 ints. (see sample corde below).

You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.

Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.

Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.

__device__ void dev1()
{
    __shared__ int a[32] ;
    a[threadIdx.x] = threadIdx.x ;

    if (threadIdx.x == 0)
        printf ("dev1 : %x\n", a) ;
}

__device__ void dev2()
{
    __shared__ int a[32] ;
    a[threadIdx.x] = threadIdx.x * 5 ;

    if (threadIdx.x == 0)
        printf ("dev2 : %x\n", a) ;
}

__global__ void kernel(int* res, int* res2)
{
    __shared__ int a[32] ;
    extern __shared__ int b[];

    a[threadIdx.x] = 0 ;
    b[threadIdx.x] = threadIdx.x * 3 ;

    dev1();
    __syncthreads();
    dev2();
    __syncthreads();

    res[threadIdx.x] = a[threadIdx.x] ;
    res2[threadIdx.x] = b[threadIdx.x] ;

    if (threadIdx.x == 0)
        printf ("global a : %x\n", a) ;
    if (threadIdx.x == 0)
        printf ("global b : %x\n", b) ;
}

int main()
{
    int* dres  ;
    int* dres2 ;

    cudaMalloc <> (&dres, 32*sizeof(int)) ;
    cudaMalloc <> (&dres2, 32*sizeof(int)) ;

    kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);

    int hres[32] ;
    int hres2[32] ;

    cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
    cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;

    for (int k = 0 ; k < 32 ; ++k)
    {
        printf ("%d -- %d \n", hres[k], hres2[k]) ;
    }
    return 0 ;
}

This code outputs the ptxas info using 384 bytes smem, that is one array for global a array, a second for dev1 method a array, and a third for dev2 method a array. Totalling 3*32*sizeof(float)=384 bytes.

When running the kernel with dynamic shared memory equals to 32*sizeof(float), the pointer to b starts right after these three arrays.

EDIT: The ptx file generated by this code holds declarations of statically-defined shared memory,

.shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];

except for the entry-point where it is defined in the body of the method

// _ZZ6kernelPiS_E1a has been demoted

The shared space of the memory is defined in the PTX documentation here:

The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables.

Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.

During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF

1             IADD32I R1, R1, -0x8;
2             S2R R0, SR_LMEMHIOFF;
3             ISETP.GE.U32.AND P0, PT, R1, R0, PT;

and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).

I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the b address does not change, and the unallocated shared memory region is never accessed).

Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.