Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once the CUDA kernel is launched control returns immediately to the CPU. The CPU continues doing some useful work while the GPU is busy number crunching
unless the CPU is forcefully stalled using cudaThreadsynchronize()
or cudaMemcpy()
.
Now I have just started using the Thrust library for CUDA. Are the function calls in Thrust synchronous or asynchronous?
In other words, if I invoke thrust::sort(D.begin(),D.end());
where D is a device vector, does it make sense to measure the sorting time using
start = clock();//Start
thrust::sort(D.begin(),D.end());
diff = ( clock() - start ) / (double)CLOCKS_PER_SEC;
std::cout << "\nDevice Time taken is: " <<diff<<std::endl;
If the function call is asynchronous then diff will be 0 seconds for any vector (which is junk for timings), but if it is synchronous I will indeed get the real time performance.