How do I know if the kernels are executing concurr

2019-01-23 21:16发布

问题:

I have a GPU with CC 3.0, so it should support 16 concurrent kernels. I am starting 10 kernels by looping through clEnqueueNDRangeKernel for 10 times. How do I get to know that the kernels are executing concurrently?

One way which I have thought is to get the time before and after the NDRangeKernel statement. I might have to use events so as to ensure the execution of the kernel has completed. But I still feel that the loop will start the kernels sequentially. Can someone help me out..

回答1:

To determine if your kernel executions overlap, you have to profile them. This requires several steps:

1. Creating the command-queues

Profiling data is only collected if the command-queue is created with the property CL_QUEUE_PROFILING_ENABLE:

cl_command_queue queues[10];
for (int i = 0; i < 10; ++i) {
  queues[i] = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE,
                                   &errcode);
}

2. Making sure all kernels start at the same time

You are right in your assumption that the CPU queues the kernels sequentially. However, you can create a single user event and add it to the wait list for all kernels. This causes the kernels not to start running before the user event is completed:

// Create the user event
cl_event user_event = clCreateUserEvent(context, &errcode);

// Reserve space for kernel events
cl_event kernel_events[10];

// Enqueue kernels
for (int i = 0; i < 10; ++i) {
  clEnqueueNDRangeKernel(queues[i], kernel, work_dim, global_work_offset,
                         global_work_size, 1, &user_event, &kernel_events[i]);
}

// Start all kernels by completing the user event
clSetUserEventStatus(user_event, CL_COMPLETE);

3. Obtain profiling times

Finally, we can collect the timing information for the kernel events:

// Block until all kernels have run to completion
clWaitForEvents(10, kernel_events);

for (int i = 0; i < 10; ++i) {
  cl_ulong start;
  clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_START,
                          sizeof(start), &start, NULL);
  cl_ulong end;
  clGetEventProfilingInfo(kernel_event[i], CL_PROFILING_COMMAND_END,
                          sizeof(end), &end, NULL);
  printf("Event %d: start=%llu, end=%llu", i, start, end);
}

4. Analyzing the output

Now that you have the start and end times of all kernel runs, you can check for overlaps (either by hand or programmatically). The output units are nanoseconds. Note however that the device timer is only accurate to a certain resolution. You can query the resolution using:

size_t resolution;
clGetDeviceInfo(device, CL_DEVICE_PROFILING_TIMER_RESOLUTION,
                sizeof(resolution), &resolution, NULL);

FWIW, I tried this on a NVIDIA device with CC 2.0 (which should support concurrent kernels) and observed that the kernels were run sequentially.



回答2:

You can avoid all the boilerplate code suggested in the other answers (which are correct by the way) by using C Framework for OpenCL, which simplifies this task a lot, and gives you detailed information about OpenCL events (kernel execution, data transfers, etc), including a table dedicated to overlapped execution of said events.

I developed this library in order to simplify the process described in the other answers.

Basic example:

ProfCLProfile* profile = profcl_profile_new();
cl_event events[NUMBER_OF_CL_EVENTS];
...

/* Start basic timming / profiling. */
profcl_profile_start(profile);

/* OpenCL data transfers, kernel executions, etc. */
...
clFinish(queue0);
clFinish(queue1);
...

/* Manage and show profiling info */
profcl_profile_stop(profile); 
profcl_profile_add(profile, "Transfer data to device", events[0], NULL);
profcl_profile_add(profile, "Kernel 1 execution", events[1], NULL);
profcl_profile_add(profile, "Kernel 2 execution", events[2], NULL);
...
profcl_profile_aggregate(profile, NULL);
profcl_print_info(profile, PROFCL_AGGEVDATA_SORT_TIME, NULL);

/* Two nice detailed tables will be printed: one for individual */
/* events (sorted by name or execution time), and another describing */
/* overlap of events. */


回答3:

Yes, as you suggest, try to use the events, and analyze all the QUEUED, SUBMIT, START, END values. These should be absolute values in "device time", and you may be able to see if processing (START to END) overlaps for the different kernels.



标签: opencl