可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

When timing CUDA kernels, the following doesn't work because the kernel doesn't block the CPU program execution while it executes:

start timer
kernel<<<g,b>>>();
end timer

I've seen three basic ways of (successfully) timing CUDA kernels:

(1) Two CUDA eventRecords.

float responseTime; //result will be in milliseconds
cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start);
cudaEvent_t stop;  cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time

(2) One CUDA eventRecord.

float start = read_timer(); //helper function on CPU, in milliseconds
cudaEvent_t stop;  cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
float responseTime = read_timer() - start;

(3) deviceSynchronize instead of eventRecord. (Probably only useful when using programming in a single stream.)

float start = read_timer(); //helper function on CPU, in milliseconds
kernel<<<g,b>>>();
cudaDeviceSynchronize();
float responseTime = read_timer() - start;

I experimentally verified that these three strategies produce the same timing result.

Questions:

What are the tradeoffs of these strategies? Any hidden details here?
Aside from timing many kernels in multiple streams, is there any advantages of using two event records and the cudaEventElapsedTime() function?

You can probably use your imagination to figure out what read_timer() does. Nevertheless, it can't hurt to provide an example implementation:

double read_timer(){
    struct timeval start;
    gettimeofday( &start, NULL ); //you need to include <sys/time.h>
    return (double)((start.tv_sec) + 1.0e-6 * (start.tv_usec))*1000; //milliseconds
}

回答1:

You seem to have ruled out most of the differences by saying they all produce the same result for the relatively simple case you have shown (probably not exactly true, but I understand what you mean), and "Aside from timing (complex sequences) ..." where the first case is clearly better.

One possible difference would be portability between windows and linux. I believe your example read_timer function is linux-oriented. You could probably craft a read_timer function that is "portable" but the cuda event system (method 1) is portable as-is.

回答2:

Option (1) uses cudaEventRecord to time the CPU. This is highly inefficient and I would discourage using cudaEventRecord for this purpose. cudaEventRecord can be used to time the GPU push buffer time to execute kernel as follows:

float responseTime; //result will be in milliseconds
cudaEvent_t start;
cudaEvent_t stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
kernel<<<g,b>>>();
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time

The code needs to be changed slightly if you submit multiple items of work to multiple streams. I would recommend reading the answer to Difference in time reported by NVVP and counters

Option (2) and (3) are similar for the given example. Option (2) can be more flexible.