C++ 17 introduced a number of new algorithms to support parallel execution, in particular std::reduce is a parallel version of std::accumulate which permits non-deterministic
behaviour for non-commutative
operations, such as floating point addition. I want to implement a reduce algorithm using OpenCL 2.
Intel have an example here which uses OpenCL 2 work group
kernel functions to implement a std::exclusive_scan OpenCL 2 kernel. Below is kernel to sum floats, based on Intel's exclusive_scan
example:
kernel void sum_float (global float* sum, global float* values)
{
float sum_val = 0.0f;
for (size_t i = 0u; i < get_num_groups(0); ++i)
{
size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
float value = work_group_reduce_add(values[index]);
sum_val += work_group_broadcast(value, 0u);
}
sum[0] = sum_val;
}
The kernel above works (or seems to!). However, exclusive_scan
required the work_group_broadcast
function to pass the last value of one work group
to the next, whereas this kernel only requires the result of work_group_reduce_add to be added to sum_val
, so an atomic add
is more appropriate.
OpenCL 2 provides an atomic_int
which supports atomic_fetch_add
. An integer version of the kernel above using atomic_int is:
kernel void sum_int (global int* sum, global int* values)
{
atomic_int sum_val;
atomic_init(&sum_val, 0);
for (size_t i = 0u; i < get_num_groups(0); ++i)
{
size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
int value = work_group_reduce_add(values[index]);
atomic_fetch_add(&sum_val, value);
}
sum[0] = atomic_load(&sum_val);
}
OpenCL 2 also provides an atomic_float
but it doesn't support atomic_fetch_add
.
What is the best way to implement an OpenCL2 kernel to sum floats?
this has a race condition to write data to sum's zero-indexed element, all workgroups are doing same computation which makes this O(N*N) instead of O(N) and takes more than 1100 milliseconds to complete a 1M-element array sum.
For same 1-M element array, this(global=1M, local=256)
followed by this (global=4k, local=256)
does the same thing in a few miliseconds except a third step. First one gets each group sums into their group-id related item and second kernel sums those into 16 values and these 16 values can easily summed by CPU(microseconds or less)(as third step).
Program works like this:
if you need to use atomics, you should do it as sparsely as possible. Easiest example can be using local atomics to sum many values by each group and then doing last step using a single global atomic function per group to add all. I don't have a C++ setup ready for OpenCL for now, but I guess OpenCL 2.0 atomics are better when you are using multiple devices with same memory resource(probably streaming mode or in SVM) and/or a CPU using C++17 functions. If you don't have multiple devices computing on same area at same time, then I suppose that these new atomics can only be a micro-optimization on top of already working OpenCL 1.2 atomics. I didn't use these new atomics so take all these as a grain of salt.