When timing CUDA kernels, the following doesn't work because the kernel doesn't block the CPU program execution while it executes:
start timer
kernel<<<g,b>>>();
end timer
I've seen three basic ways of (successfully) timing CUDA kernels:
(1) Two CUDA eventRecords.
float responseTime; //result will be in milliseconds
cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start);
cudaEvent_t stop; cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time
(2) One CUDA eventRecord.
float start = read_timer(); //helper function on CPU, in milliseconds
cudaEvent_t stop; cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
float responseTime = read_timer() - start;
(3) deviceSynchronize instead of eventRecord. (Probably only useful when using programming in a single stream.)
float start = read_timer(); //helper function on CPU, in milliseconds
kernel<<<g,b>>>();
cudaDeviceSynchronize();
float responseTime = read_timer() - start;
I experimentally verified that these three strategies produce the same timing result.
Questions:
- What are the tradeoffs of these strategies? Any hidden details here?
- Aside from timing many kernels in multiple streams, is there any advantages of using two event records and the
cudaEventElapsedTime()
function?
You can probably use your imagination to figure out what read_timer()
does. Nevertheless, it can't hurt to provide an example implementation:
double read_timer(){
struct timeval start;
gettimeofday( &start, NULL ); //you need to include <sys/time.h>
return (double)((start.tv_sec) + 1.0e-6 * (start.tv_usec))*1000; //milliseconds
}