I have an array of vertices with this kind of structure:
[x0, y0, z0, empty float, x1, y1, z1, empty float, x2, y2, z2, empty float, ...]
I need to find minX
, minY
, minZ
, maxX
, maxY
and maxZ
using CUDA. I wrote a proper reduction algorithm, but it occurs to be a little too slow. I decided to use the THRUST library. There is a highly optimized reduce()
, or even better minmax_element()
, method which is a way to find max and min of an array simultaneously, but I can't find a fast way to use then only every 4
th index. Copying data to 3
separated arrays is not a solution I'm looking for.
Is there a way (some kind of tricks with Thrust iterators or something like this) to pass a stride to reduce()
?
I optimized my kernel following this document: http://www.cuvilib.com/Reduction.pdf
It looks like this for now:
Invoked like this:
Helpers:
In a few days I'll probably test which solution is faster.
You can use a strided range method, along with 3 calls to thrust::minmax_element, to get your desired result without modifying data storage.
Here's a worked example:
I've updated the above example to include for comparison an "optimized" reduction kernel that does all 6 reductions (min and max operations) in a single kernel call.
As expected, this approach runs faster than 3 individual thrust calls to produce the same result, about 5-8 times faster in my case (RHEL5.5, Quadro5000, CUDA 6.5RC), depending on data size. Note that although I have chosen a data size (
DSIZE
) here that is a power of 2, the entire example works correctly for arbitrary data sizes. I have dispensed with proper cuda error checking for the sake of brevity of presentation.EDIT: Following the suggestion by @JaredHoberock I have added a 3rd approach that allows a single call to
thrust::transform_reduce
to produce all 6 results. These are the "thrust2" results above. This method is about 3x faster than the first (three-thrust-call) method. Still not as fast as the cuda kernel method, but perhaps this thrust approach can be optimized further.This is an application of the strided range example.