I want to optimize histogram statistic code with neon intrinsics.But I didn't succeed.Here is the c code:
#define NUM (7*1024*1024)
uint8 src_data[NUM];
uint32 histogram_result[256] = {0};
for (int i = 0; i < NUM; i++)
{
histogram_result[src_data[i]]++;
}
Historam statistic is more like serial processing.It's difficult to optimize with neon intrinsics.Does anyone know how to optimize?Thanks in advance.
You can't vectorise the stores directly, but you can pipeline them, and you can vectorise the address calculation on 32-bit platforms (and to a lesser extent on 64-bit platforms).
The first thing you'll want to do, which doesn't actually require NEON to benefit, is to unroll the histogram array so that you can have more data in flight at once:
Note that
p0
top3
can never point to the same address, so reordering their reads and writes is just fine.From that you can vectorise the calculation of
p0
top3
with intrinsics, and you can vectorise the finalisation loop.Test it as-is first (because I didn't!). Then you can experiment with structuring the array as
result[4][256]
instead ofresult[256][4]
, or using a smaller or larger unroll factor.Applying some NEON intrinsics to this:
With the histogram array unrolled x8 rather than x4 you might want to use eight scalar accumulators instead of four, but you have to remember that that implies eight count registers and eight address registers, which is more registers than 32-bit ARM has (since you can't use SP and PC).
Unfortunately, with address calculation in the hands of NEON intrinsics, I think the compiler can't safely reason on how it might be able to re-order reads and writes, so you have to reorder them explicitly and hope that you're doing it the best possible way.