I have some functions that load a variable in constant device memory and launch a kernel function. I noticed that the first time that one function load a variable in constant memory takes 0.6 seconds but the next loads on constant memory are very fast(0.0008 seconds). This behaviour occours regardless of which function is the first in the main. Below an example code:
__constant__ double res1;
__global__kernel1(...) {...}
void function1() {
double resHost = 255 / ((double) size);
CUDA_CHECK_RETURN(cudaMemcpyToSymbol(res1, &resHost, sizeof(double)));
//prepare and launch kernel
}
__constant__ double res2;
__global__kernel2(...) {...}
void function2() {
double resHost = 255 / ((double) size);
CUDA_CHECK_RETURN(cudaMemcpyToSymbol(res2, &resHost, sizeof(double)));
//prepare and launch kernel
}
int main(){
function1(); //takes 0.6 seconds for loading
function2(); // takes 0.0008 seconds for loading
function1(); //takes 0.0008 seconds for loading
return 0;
}
Why is this happening? Can I avoid it?
Lazy runtime API context establishment and setup.
No. The first runtime API call to require a context will incur significant setup latency, in your case that is the first
cudaMemcpyToSymbol
call.