CUDA - Memory Limit - Vector Summation

2019-06-09 01:54发布

I'm trying to learn CUDA and the following code works OK for the values N<= 16384, but fails for the greater values(Summation check at the end of the code fails, c values are always 0 for the index value of i>=16384).

#include<iostream>
#include"cuda_runtime.h"
#include"../cuda_be/book.h"

#define N (16384)

__global__ void add(int *a,int *b,int *c)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if(tid<N)
    {
        c[tid] = a[tid] + b[tid];
        tid += blockDim.x * gridDim.x;
    }
}

int main()
{
    int a[N],b[N],c[N];
    int *dev_a,*dev_b,*dev_c;

    //allocate mem on gpu
    HANDLE_ERROR(cudaMalloc((void**)&dev_a,N*sizeof(int)));
    HANDLE_ERROR(cudaMalloc((void**)&dev_b,N*sizeof(int)));
    HANDLE_ERROR(cudaMalloc((void**)&dev_c,N*sizeof(int)));

    for(int i=0;i<N;i++)
    {
        a[i] = -i;
        b[i] = i*i;
    }

    HANDLE_ERROR(cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice));
    HANDLE_ERROR(cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice));
    system("PAUSE");
    add<<<128,128>>>(dev_a,dev_b,dev_c);

    //copy the array 'c' back from the gpu to the cpu

    HANDLE_ERROR( cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost));

    system("PAUSE");
    bool success = true;
    for(int i=0;i<N;i++)
    {
        if((a[i] + b[i]) != c[i])
        {
            printf("Error in %d: %d + %d != %d\n",i,a[i],b[i],c[i]);
            system("PAUSE");
            success = false;
        }

    }

    if(success) printf("We did it!\n");

    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);

    return 0;
}

I think it's a shared memory related problem, but I can't come up with a good explanation(Possible lack of knowledge). Could you provide me an explanation and a workaround to run for the values of N greater than 16384. Here is the specs for my GPU:

General Info for device 0
Name: GeForce 9600M GT
Compute capability: 1.1
Clock rate: 1250000
Device copy overlap : Enabled
Kernel Execution timeout : Enabled
Mem info for device 0
Total global mem: 536870912
Total const mem: 65536
Max mem pitch: 2147483647
Texture Alignment 256
MP info about device 0
Multiproccessor count: 4
Shared mem per mp: 16384
Registers per mp: 8192
Threads in warp: 32
Max threads per block: 512
Max thread dimensions: (512,512,64)
Max grid dimensions: (65535,65535,1)

标签: cuda nvidia
3条回答
啃猪蹄的小仙女
2楼-- · 2019-06-09 02:00

If N is:

#define N (33 * 1024) //value defined in Cuda by Examples

The same code I found in Cuda by Example, but the value of N was different. I think that o value of N cant be 33 * 1024. I must change the parameters number of block and number of threads per blocks. Because:

add<<<128,128>>>(dev_a,dev_b,dev_c); //16384 threads 

(128 * 128) < (33 * 1024) so we have a crash.

查看更多
相关推荐>>
3楼-- · 2019-06-09 02:12

You aren't running out of shared memory, your vector arrays are being copied into your device's global memory. As you can see this has far more space available than the 196608 bytes (16384*4*3) you need.

The reason for your problem is that you are only performing one addition operation per thread so hence with this structure, the maximum dimension that your vectors can be is the block*thread parameters in your kernel launch as tera has pointed out. By correcting

if(tid<N)

to

while(tid<N)

in your code, each thread will perform its addition on multiple indexes and the whole array will be considered.

For more information about the memory hierarchy and the various different places memory can sit, you should read sections 2.3 and 5.3 of the CUDA_C_Programming_Guide.pdf provided with the CUDA toolkit.

Hope that helps.

查看更多
The star\"
4楼-- · 2019-06-09 02:20

You probably intended to write

while(tid<N)

not

if(tid<N)
查看更多
登录 后发表回答