Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
How are threads organized to be executed by a GPU?
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
How are threads organized to be executed by a GPU?
If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).
threads are organized in blocks. A block is executed by a multiprocessing unit. The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).
Obviously, if you need more than those 4*768 threads you need more than 4 blocks. Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are being executed simultaneously).
Suppose we want one thread to process one pixel (i,j).
We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks (so to have 512x512 threads = 4096*64)
It's common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.
dim3 threadsPerBlock(8, 8); // 64 threads
and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.
dim3 numBlocks(imageWidth/threadsPerBlock.x, /* for instance 512/8 = 64*/
imageHeight/threadsPerBlock.y);
The kernel is launched like this:
myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );
Finally: there will be something like "a queue of 4096 blocks", where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.
In the kernel the pixel (i,j) to be processed by a thread is calculated this way:
uint i = (blockIdx.x * blockDim.x) + threadIdx.x;
uint j = (blockIdx.y * blockDim.y) + threadIdx.y;
suppose a 9800GT GPU: 14 multiprocessors, each has 8 threadprocessors and warpsize is 32 which means each threadprocessor handles up to 32 threads. 14*8*32=3584 is the maximum number of actuall cuncurrent threads.
if you execute this kernel with more than 3584 threads (say 4000 threads and it's not important how you define the block and grid. gpu will treat them like the same):
func1();
__syncthreads();
func2();
__syncthreads();
then the order of execution of those two functions are as follows:
1.func1 is executed for the first 3584 threads
2.func2 is executed for the first 3584 threads
3.func1 is executed for the remaining threads
4.func2 is executed for the remaining threads