Summary
Array [A - B - - - C]
in device memory but want [A B C]
- what's the quickest way with CUDA C?
Context
I have an array A
of integers on device (GPU) memory. At each iteration, I randomly choose a few elements that are larger than 0 and subtract 1 from them. I maintain a sorted lookup array L
of those elements that are equal to 0:
Array A:
@ iteration i: [0 1 0 3 3 2 0 1 2 3]
@ iteration i + 1: [0 0 0 3 2 2 0 1 2 3]
Lookup for 0-elements L:
@ iteration i: [0 - 2 - - - 6 - - -] -> want compacted form: [0 2 6]
@ iteration i + 1: [0 1 2 - - - 6 - - -] -> want compacted form: [0 1 2 6]
(Here, I randomly chose elements 1
and 4
to subtract 1 from. In my implementation in CUDA C, each thread maps onto an element in A
, and so the lookup array is sparse to prevent data races and to maintain a sorted ordering (e.g. [0 1 2 6]
rather than [0 2 6 1]
).)
Later, I will do some operation only for those elements that are equal to 0. Hence I need to compact my sparse lookup array L
, so that I can map threads to 0-elements.
As such, what is the most efficient way to compact a sparse array on device memory with CUDA C?
Many thanks.
Suppose I have:
And my desired result is:
In effect we are removing elements that are zero, or copying elements only if non-zero.
the struct defintion provides us with a functor that tests for zero elements. Note that in thrust, there are no kernels and we are not writing device code directly. All that happens behind the scenes. And I'd definitely suggest familiarizing yourself with the quick start guide, so as not to turn this question into a tutorial on thrust.
After reviewing the comments, I think this modified version of the code will work around the cuda 4.0 issues: