When we check the register usage by using xptxas we see something like this:
ptxas info : Used 63 registers, 244 bytes cmem[0], 51220 bytes cmem[2], 24 bytes cmem[14], 20 bytes cmem[16]
I wonder if currently there is any documentation that clearly explains cmem[x]. What is the point of separating constant memory into multiple banks, how many banks are there in total, and what are other banks other than 0, 2, 14, 16 used for?
as a side note, @njuffa (special thanks to you) previously explained on nvidia's forum what is bank 0,2,14,16:
Used constant memory is partitioned in constant program ‘variables’ (bank 1), plus compiler generated constants (bank 14).
cmem[0]:kernel arguments
cmem[2]:user defined constant objects
cmem[16]:compiler generated constants (some of which may correspond to literal constants in the source code)
Please see "Miscellaneous NVCC Usage". They mention, that the constant bank allocation is profile-specific.
In the PTX guide, they say that apart from 64KB constant memory, they had 10 more banks for constant memory. The driver may allocate and initialize constant buffers in these regions and pass pointers to the buffers as kernel function parameters.
I guess, that profile given for nvcc will take care of what constants go into which memory. Anyway, we don't need to worry if each constant memory cmem[n] is less than 64KB, because each bank is of size 64KB and common to all threads in grid.
The usage of GPU constant banks by CUDA is not officially documented to my knowledge. The number and usage of constant banks does differ between GPU generations. These are low-level implementation details that programmers do not have to worry about.
The usage of constants banks can be reversed engineered, if so desired, by looking at the machine code (SASS) generated for a given platform. In fact, this is how I came up with the information cited in the original question (this information came from an NVIDIA developer forum post of mine). As I recall, the information I gave there was based on adhoc reverse engineering specifically applied to Fermi-class devices, but I am unable to verify this at this time as the forums are inaccessible at the moment.
One reason for having multiple constant banks is to reserve the user visible constant memory for the use of CUDA programmers, while storing additional read-only information provided by hardware or tools in additional constant banks.
Note that the CUDA math library is provided as source files and the functions get inlined into user code, therefore constant memory usage of CUDA math library functions is included in the statistics for the user-visible constant memory.