I have an array of unsigned integers stored on the GPU with CUDA (typically 1000000
elements). I would like to count the occurence of every number in the array. There are only a few distinct numbers (about 10
), but these numbers can span from 1 to 1000000
. About 9/10
th of the numbers are 0
, I don't need the count of them. The result looks something like this:
58458 -> 1000 occurences
15 -> 412 occurences
I have an implementation using atomicAdd
s, but it is too slow (a lot of threads write to the same address). Does someone know of a fast/efficient method?
You can implement a histogram by first sorting the numbers, and then doing a keyed reduction.
The most straightforward method would be to use
thrust::sort
and thenthrust::reduce_by_key
. It's also often much faster than ad hoc binning based on atomics. Here's an example.I suppose you can find help in the CUDA examples, specifically the histogram examples. They are part of the GPU computing SDK. You can find it here http://developer.nvidia.com/cuda-cc-sdk-code-samples#histogram. They even have a whitepaper explaining the algorithms.
I'm comparing two approaches suggested at the duplicate question thrust count occurence, namely,
thrust::counting_iterator
andthrust::upper_bound
, following the histogram Thrust example;thrust::unique_copy
andthrust::upper_bound
.Below, please find a fully worked example.
The first approach has shown to be the fastest. On an NVIDIA GTX 960 card, I have had the following timings for a number of
N = 1048576
array elements:Please, note that there is no strict need to calculate the adjacent difference explicitly, since this operation can be manually done during a kernel processing, if needed.