Timing Kernel launches in CUDA while using Thrust

Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once the CUDA kernel is launched control returns immediately to the CPU. The CPU continues doing some useful work while the GPU is busy number crunching unless the CPU is forcefully stalled using cudaThreadsynchronize() or cudaMemcpy() .

Now I have just started using the Thrust library for CUDA. Are the function calls in Thrust synchronous or asynchronous?

In other words, if I invoke thrust::sort(D.begin(),D.end()); where D is a device vector, does it make sense to measure the sorting time using

        start = clock();//Start

             thrust::sort(D.begin(),D.end());

        diff = ( clock() - start ) / (double)CLOCKS_PER_SEC;
        std::cout << "\nDevice Time taken is: " <<diff<<std::endl;

If the function call is asynchronous then diff will be 0 seconds for any vector (which is junk for timings), but if it is synchronous I will indeed get the real time performance.

Thrust calls which invoke kernels are asynchronous, just like the underlying CUDA APIs thrust uses. Thrust calls which copy data are synchronous, just like the underlying CUDA APIs thrust uses.

So your example would only be measuring the kernel launch and thrust host side setup overheads, not the operation itself. For timing, you can get around this by calling either cudaThreadSynchronize or cudaDeviceSynchronize (the later in CUDA 4.0 or later) after the thrust kernel launch. Alternatively, if you include a post kernel launch copy operation and record the stop time after that, your timing will include setup, execution, and copying time.

In your example this would look something like

   start = clock();//Start 

   thrust::sort(D.begin(),D.end()); 
   cudaThreadSynchronize(); // block until kernel is finished

   diff = ( clock() - start ) / (double)CLOCKS_PER_SEC; 
   std::cout << "\nDevice Time taken is: " <<diff<<std::endl;