Let's say I have two device_vector<byte> arrays, d_keys
and d_data
.
If d_data
is, for example, a flattened 2D 3x5 array ( e.g. { 1, 2, 3, 4, 5, 6, 7, 8, 9, 8, 7, 6, 5, 4, 3 } ) and d_keys
is a 1D array of size 5 ( e.g. { 1, 0, 0, 1, 1 } ), how can I do a reduction such that I'd end up only adding values on a per-row basis if the corresponding d_keys
value is one ( e.g. ending up with a result of { 10, 23, 14 } )?
The sum_rows.cu example allows me to add every value in d_data
, but that's not quite right.
Alternatively, I can, on a per-row basis, use a zip_iterator
and combine d_keys
with one row of d_data
at a time, and do a transform_reduce
, adding only if the key value is one, but then I'd have to loop through the d_data
array.
What I really need is some sort of transform_reduce_by_key
functionality that isn't built-in, but surely there must be a way to make it!
Based on the additional comment that instead of 3 rows there are thousands of rows, we can write a transform functor that sums an entire row. Based on the fact that there are thousands of rows, this should keep the machine pretty busy:
This approach has the drawback that in general accesses to the
vals
array will not be coalesced. However for a few thousand rows the cache may offer significant relief. We can fix this problem by re-ordering the data to be stored in column-major form in the flattened array, and change our indexing method in the loop in the functor to be like this:If preferred, you can pass ROW as an additional parameter to the functor.
Here is some sample code that does something like what you are after, using the approach I outlined in my comment below your question. In fact we want to use 4-tuples, to pick up your key value. Reproducing the suitably modified comment here:
You could make a zip iterator that zips your 3 rows together plus the key "row" and passes a 4-tuple to a special functor. Your special functor would then do a reduction on the array of 3-tuples (using the key also) and return a result that is a 4-tuple. The thrust dot product example may give you some ideas.
This is one possible approach:
Notes:
host_vector
. Extending it to work withdevice_vector
, or templatizing it to work with something other thanint
should be straightforward.