Measuring execution time of OpenCL kernels

2020-02-05 20:38发布

I have the following loop that measures the time of my kernels:

double elapsed = 0;
cl_ulong time_start, time_end;
for (unsigned i = 0; i < NUMBER_OF_ITERATIONS; ++i)
{
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, NULL, 0, NULL, &event); checkErr(err, "Kernel run");
    err = clWaitForEvents(1, &event); checkErr(err, "Kernel run wait fro event");
    err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL); checkErr(err, "Kernel run get time start");
    err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL); checkErr(err, "Kernel run get time end");
    elapsed += (time_end - time_start);
}

Then I divide elapsed by NUMBER_OF_ITERATIONS to get the final estimate. However, I am afraid the execution time of individual kernels is too small and hence can introduce uncertainty into my measurement. How can I measure the time taken by all NUMBER_OF_ITERATIONS kernels combined?

Can you suggest a profiling tool, which could help with this, as I do not need to access this data programmatically. I use NVIDIA's OpenCL.

4条回答
▲ chillily
2楼-- · 2020-02-05 20:42

The time that is measured is returned in nanoseconds, but you're right: The resolution of the timer is lower. However, I wonder what the actual execution time of your kernel is when you say that the time is too short to measure accurately (my gut feeling is that the resolution should be in the range of microseconds).

The most appropriate way of measuring the total time of multiple iterations depends on what "multiple" means here. Is NUMBER_OF_ITERATIONS=5 or NUMBER_OF_ITERATIONS=500000? If the number of iterations is "large", you may just use the system clock, possibly with OS-specific functions like QueryPerformanceCounter on windows (also see, for example, Is there a way to measure time up to micro seconds using C standard library? ), but of course, the precision of the system clock might be lower than that of the OpenCL device, so whether this makes sense really depends on the number of iterations.

It's a pity that NVIDIA removed OpenCL support from their Visual Profiler, though...

查看更多
男人必须洒脱
3楼-- · 2020-02-05 20:45

On Intel's OpenCL GPU implementation I've been successful with your approach (timing per kernel) and prefer it to batching a stream of NDRanges.

  • An alternate approach is to run it N times with and measure time with markers events as in the approach proposed in this question (the question not the answer).

  • Times for short kernels are generally at least in the microseconds realm in my experience.

  • You can check the timer resolution using clGetDeviceInfo with CL_DEVICE_PROFILING_TIMER_RESOLUTION (e.g. 80 ns on my setup).

查看更多
叼着烟拽天下
4楼-- · 2020-02-05 20:53

you need follow next steps to measure the execution time of OpenCL kernel execution time:

  1. Create a queue, profiling need been enable when the queue is created:

    cl_command_queue command_queue;
    command_queue = clCreateCommandQueue(context, devices[deviceUsed], CL_QUEUE_PROFILING_ENABLE, &err);
    
  2. Link an event when launch a kernel

    cl_event event;
    err=clEnqueueNDRangeKernel(queue, kernel, woridim, NULL, workgroupsize, NULL, 0, NULL, &event);
    
  3. Wait for the kernel to finish

    clWaitForEvents(1, &event);
    
  4. Wait for all enqueued tasks to finish

    clFinish(queue);
    
  5. Get profiling data and calculate the kernel execution time (returned by the OpenCL API in nanoseconds)

    cl_ulong time_start;
    cl_ulong time_end;
    
    clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
    clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
    
    double nanoSeconds = time_end-time_start;
    printf("OpenCl Execution time is: %0.3f milliseconds \n",nanoSeconds / 1000000.0);
    
查看更多
Lonely孤独者°
5楼-- · 2020-02-05 20:59

The profiling function returns nano seconds, and is very accurate (~50ns), however, the execution has different execution times, depending on others minor issues you can't control.

This reduces your problematic on what you want to measure:

  • Measuring the kernel execution time: Your approach is correct, the accuracy of the average execution time measured will increase as you increase N. This accounts only for the execution time, no overhead taken in consideration.

  • Measuring the kernel execution time + overhead: You should use events as well but measure since CL_PROFILING_COMMAND_SUBMIT, to account for extra execution overhead.

  • Measuring the real host side execution time: You should use events as well but measure since the first event start, to the last event end. Using CPU timing measurement is another possibility. If you want to measure this, then you should remove the waitforevents from the loop, to allow maximum throughput to the OpenCL system (and less overhead possible).

Answering the Tools question, I recomend to use nVIDIA visual profiler. BUt since is no longer available for OpenCL, you should use the Visual Studio Add on or an old version (CUDA 3.0) of the nvprofiler.

查看更多
登录 后发表回答