How is a CUDA kernel launched?

2019-03-27 05:57发布

I have created a simple CUDA application to add two matrices. It is compiling fine. I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? I mean, in what fashion every thread will execute each element of the matrices.

I know this is a very basic concept, but I don't know this. I am confused regarding the flow.

3条回答
Luminary・发光体
2楼-- · 2019-03-27 06:22

You launch a grid of blocks.

Blocks are indivisibly assigned to multiprocessors (where the number of blocks on the multiprocessor determine the amount of available shared memory).

Blocks are further split into warps. For a Fermi GPU that is 32 threads that either execute the same instruction or are inactive (because they branched away, e.g. by exiting from a loop earlier than neighbors within the same warp or not taking the if they did). On a Fermi GPU at most two warps run on one multiprocessor at a time.

Whenever there is latency (that is execution stalls for memory access or data dependencies to complete) another warp is run (the number of warps that fit onto one multiprocessor - of the same or different blocks - is determined by the number of registers used by each thread and the amount of shared memory used by a/the block(s)).

This scheduling happens transparently. That is, you do not have to think about it too much. However, you might want to use the predefined integer vectors threadIdx (where is my thread within the block?), blockDim (how large is one block?), blockIdx (where is my block in the grid?) and gridDim (how large is the grid?) to split up work (read: input and output) among the threads. You might also want to read up how to effectively access the different types of memory (so multiple threads can be serviced within a single transaction) - but that's leading off topic.

NSight provides a graphical debugger that gives you a good idea of what's happening on the device once you got through the jargon jungle. Same goes for its profiler regarding those things you won't see in the debugger (e.g. stall reasons or memory pressure).

You can synchronize all threads within the grid (all there are) by another kernel launch. For non-overlapping, sequential kernel execution no further synchronization is needed.

The threads within one grid (or one kernel run - however you want to call it) can communicate via global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).

You can synchronize all threads within one block with the intrinsic instruction __syncthreads() (all threads will be active afterwards - although, as always, at most two warps can run on a Fermi GPU). The threads within one block can communicate via shared or global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).

As mentioned earlier, all threads within a warp are always "synchronized", although some might be inactive. They can communicate through shared or global memory (or "lane swapping" on upcoming hardware with compute capability 3). You can use atomic operations (for arithmetic) and volatile-qualified shared or global variables (load or store access happening sequentially within the same warp). The volatile qualifier tells the compiler to always access memory and never registers whose state cannot be seen by other threads.

Further, there are warp-wide vote functions that can help you make branch decisions or compute integer (prefix) sums.

OK, that's basically it. Hope that helps. Had a good flow writing :-).

查看更多
Anthone
3楼-- · 2019-03-27 06:24

Lets take an example of addition of 4*4 matrices.. you have two matrices A and B, having dimension 4*4..

int main()
{
 int *a, *b, *c;            //To store your matrix A & B in RAM. Result will be stored in matrix C
 int *ad, *bd, *cd;         // To store matrices into GPU's RAM. 
 int N =4;                 //No of rows and columns.

 size_t size=sizeof(float)* N * N;

 a=(float*)malloc(size);     //Allocate space of RAM for matrix A
 b=(float*)malloc(size);     //Allocate space of RAM for matrix B

//allocate memory on device
  cudaMalloc(&ad,size);
  cudaMalloc(&bd,size);
  cudaMalloc(&cd,size);

//initialize host memory with its own indices
    for(i=0;i<N;i++)
      {
    for(j=0;j<N;j++)
         {
            a[i * N + j]=(float)(i * N + j);
            b[i * N + j]= -(float)(i * N + j);
         }
      }

//copy data from host memory to device memory
     cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
     cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);

//calculate execution configuration 
   dim3 grid (1, 1, 1); 
   dim3 block (16, 1, 1);

//each block contains N * N threads, each thread calculates 1 data element

    add_matrices<<<grid, block>>>(ad, bd, cd, N);

   cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost);  
   printf("Matrix A was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",a[i*N+j]);
        printf("\n");
    }

   printf("\nMatrix B was---\n");
   for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",b[i*N+j]);
        printf("\n");
    }

    printf("\nAddition of A and B gives C----\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",c[i*N+j]);   //if correctly evaluated, all values will be 0
        printf("\n");
    }



    //deallocate host and device memories
    cudaFree(ad); 
    cudaFree(bd); 
    cudaFree (cd);

    free(a);
    free(b);
    free(c);

    getch();
    return 1;
}

/////Kernel Part

__global__ void add_matrices(float *ad,float *bd,float *cd,int N)
{
  int index;
  index = blockIDx.x * blockDim.x + threadIDx.x            

  cd[index] = ad[index] + bd[index];
}

Lets take an example of addition of 16*16 matrices.. you have two matrices A and B, having dimension 16*16..

First of all you have to decide your thread configuration. You are suppose to launch a kernel function, which will perform the parallel computation of you matrix addition, which will get executed on your GPU device.

Now,, one grid is launched with one kernel function.. A grid can have max 65,535 no of blocks which can be arranged in 3 dimensional ways. (65535 * 65535 * 65535).

Every block in grid can have max 1024 no of threads.Those threads can also be arranged in 3 dimensional ways (1024 * 1024 * 64)

Now our problem is addition of 16 * 16 matrices..

A | 1  2  3  4 |        B | 1  2  3  4 |      C| 1  2  3  4 |
  | 5  6  7  8 |   +      | 5  6  7  8 |   =   | 5  6  7  8 | 
  | 9 10 11 12 |          | 9 10 11 12 |       | 9 10 11 12 |  
  | 13 14 15 16|          | 13 14 15 16|       | 13 14 15 16|

We need 16 threads to perform the computation.

i.e. A(1,1) + B (1,1) = C(1,1)
     A(1,2) + B (1,2) = C(1,2) 
     .        .          .
     .        .          . 
     A(4,4) + B (4,4) = C(4,4) 

All these threads will get executed simultaneously. So we need a block with 16 threads. For our convenience we will arrange threads in (16 * 1 * 1) way in a block As no of threads are 16 so we need one block only to store those 16 threads.

so, grid configuration will be dim3 Grid(1,1,1) i.e. grid will have only one block and block configuration will be dim3 block(16,1,1) i.e. block will have 16 threads arranged column wise.

Following program will give you the clear idea about its execution.. Understanding the indexing part(i.e. threadIDs, blockDim, blockID) is the important part. You need to go through the CUDA literature. Once you have clear idea about indexing, you will win the half battle! So spend some time with cuda books, different algorithms and paper-pencil of course!

查看更多
够拽才男人
4楼-- · 2019-03-27 06:41

Try 'Cuda-gdb', which is the CUDA debugger.

查看更多
登录 后发表回答