This is a question about how to determine the CUDA grid, block and thread sizes. This is an additional question to the one posted here:
https://stackoverflow.com/a/5643838/1292251
Following this link, the answer from talonmies contains a code-snippet (see below). I don't understand the comment "value usually chosen by tuning and hardware constraints".
I haven't found a good explanation or clarification that explains this in the CUDA documentation. In summary, my question is how to determine the optimal blocksize (=number of threads) given the following code:
const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / nthreads; // value determine by block size and total work
madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);
BTW, I started my question with the link above because it partly answers my first question. If this is not a proper way to ask questions on Stack Overflow, please excuse or advise me.
The blocksize is usually selected to maximize the "occupancy". Search on CUDA Occupancy for more information. In particular, see the CUDA Occupancy Calculator spreadsheet.
There are two parts to that answer (I wrote it). One part is easy to quantify, the other is more empirical.
Hardware Constraints:
This is the easy to quantify part. Appendix F of the current CUDA programming guide lists a number of hard limits which limit how many threads per block a kernel launch can have. If you exceed any of these, your kernel will never run. They can be roughly summarized as:
If you stay within those limits, any kernel you can successfully compile will launch without error.
Performance Tuning:
This is the empirical part. The number of threads per block you choose within the hardware constraints outlined above can and does effect the performance of code running on the hardware. How each code behaves will be different and the only real way to quantify it is by careful benchmarking and profiling. But again, very roughly summarized:
The second point is a huge topic which I doubt anyone is going to try and cover it in a single StackOverflow answer. There are people writing PhD theses around the quantitative analysis of aspects of the problem (see this presentation by Vasily Volkov from UC Berkley and this paper by Henry Wong from the University of Toronto for examples of how complex the question really is).
At the entry level, you should mostly be aware that the block size you choose (within the range of legal block sizes defined by the constraints above) can and does have a impact on how fast your code will run, but it depends on the hardware you have and the code you are running. By benchmarking, you will probably find that most non-trivial code has a "sweet spot" in the 128-512 threads per block range, but it will require some analysis on your part to find where that is. The good news is that because you are working in multiples of the warp size, the search space is very finite and the best configuration for a given piece of code relatively easy to find.
The answers above point out how the block size can impact performance and suggest a common heuristic for its choice based on occupancy maximization. Without wanting to provide the criterion to choose the block size, it would be worth mentioning that CUDA 6.5 (now in Release Candidate version) includes several new runtime functions to aid in occupancy calculations and launch configuration, see
CUDA Pro Tip: Occupancy API Simplifies Launch Configuration
One of the useful functions is
cudaOccupancyMaxPotentialBlockSize
which heuristically calculates a block size that achieves the maximum occupancy. The values provided by that function could be then used as the starting point of a manual optimization of the launch parameters. Below is a little example.EDIT
The
cudaOccupancyMaxPotentialBlockSize
is defined in thecuda_runtime.h
file and is defined as follows:The meanings for the parameters is the following
Note that, as of CUDA 6.5, one needs to compute one's own 2D/3D block dimensions from the 1D block size suggested by the API.
Note also that the CUDA driver API contains functionally equivalent APIs for occupancy calculation, so it is possible to use
cuOccupancyMaxPotentialBlockSize
in driver API code in the same way shown for the runtime API in the example above.