dim3 block(4, 2)
dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y);
I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. nx
is the number of columns and ny
is the number of rows.
Can you explain how the grid size is computed? Why is block.x
added to nx
and then subtracted by 1
?
There is a preview (https://books.google.com/books?id=_Z7rnAEACAAJ&printsec=frontcover#v=onepage&q&f=false) but page 53 is missing.
This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as ceil(nx/block.x)
, that is, figure out how many blocks are needed to cover the desired size, then round up.
But full floating point division and ceil is more expensive than necessary. Instead, since C defines integer division as a "floor" operation, you can add the divisor - 1 before dividing to the get the effect of a "ceiling" operation.
Try a few examples: If nx = 10
, then nx + block.x - 1
is 13, and by integer divison, you need 3 blocks of size 4.
As you noted in the comment, +block.x pushes up floor to ceiling and the -1 is for numbers that divide perfectly into the divisor. e.g. (12 + 4)/4 would be 4 when we actually want (12+4-1)/4 which 3