Which is the best way of using constants in CUDA?
One way is to define constants in constant memory, like:
// CUDA global constants
__constant__ int M;
int main(void)
{
...
cudaMemcpyToSymbol("M", &M, sizeof(M));
...
}
An alterative way would be to use the C preprocessor:
#define M = ...
I would think defining constants with the C preprocessor is much faster. Which are then the benefits of using the constant memory on a CUDA device?
Regular C/C++ style constants: In CUDA C (itself a modification of C99) constants are absolute compile time entities. This is hardly surprising given the amount of optimization that happens in NVCC is VERY involved given the nature of GPU processing.
#define
: macros are as always very inelegant but useful in a pinch.The
__constant__
variable specifier is, however a completely new animal and something of a misnomer in my opinion. I will put down what Nvidia has here in the space below:Nvidia's documentation specifies that
__constant__
is available at register level speed (near-zero latency) provided it is the same constant being accessed by all threads of a warp.They are declared at global scope in CUDA code. HOWEVER based on personal (and currently ongoing) experience you have to be careful with this specifier when it comes to separate compilation, like separating your CUDA code (.cu and .cuh files) from your C/C++ code by putting wrapper functions in C-style headers.
Unlike traditional "constant" specified variables however these are initialized at runtime fromthe host code that allocates device memory and ultimately launches the kernel. I repeat I am currently working code that demonstrates these can be set at runtime using cudaMemcpyToSymbol() before kernel execution.
They are quite handy to say the least given the L1 cache level speed that is guaranteed for access.
#define
) or via C/C++const
variables at global/file scope.__constant__
memory may be beneficial for programs who use certain values that don't change for the duration of the kernel and for which certain access patterns are present (e.g. all threads access the same value at the same time). This is not better or faster than constants that satisfy the requirements of item 1 above.