I am trying to implement histogram in Neon. Is it possible to vectorise ?
问题:
回答1:
Histogramming is almost impossible to vectorize, unfortunately.
You can probably optimise the scalar code somewhat however - a common trick is to use two histograms and then combine them at the end. This allows you to overlap loads/increments/stores and thereby bury some of the serial dependencies and associated latencies. Pseudo code:
init histogram 1 to all 0s
init histogram 2 to all 0s
loop
get input value 1
get input value 2
load count for value 1 from histogram 1
load count for value 2 from histogram 2
increment count for histogram 1
increment count for histogram 2
store count for value 1 to histogram 1
store count for value 2 to histogram 2
until done
combine histogram 1 and histogram 2
回答2:
ermig1979 has a Simd project which shows how he has done histograms using a similar approach to what @Paul-R has mentioned but also with SSE2 and AVX2 variants:
Project: https://github.com/ermig1979/Simd
Base file: https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdBaseHistogram.cpp
An AVX2 implementation can be seen here: https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdAvx2Histogram.cpp
A scalar solution can be seen below to illustrate the basic principle of creating multiple histograms that are summed at the end:
void Histogram(const uint8_t * src, size_t width, size_t height, size_t stride,
uint32_t * histogram)
{
uint32_t histograms[4][HISTOGRAM_SIZE];
memset(histograms, 0, sizeof(uint32_t)*HISTOGRAM_SIZE*4);
size_t alignedWidth = Simd::AlignLo(width, 4);
for(size_t row = 0; row < height; ++row)
{
size_t col = 0;
for(; col < alignedWidth; col += 4)
{
++histograms[0][src[col + 0]];
++histograms[1][src[col + 1]];
++histograms[2][src[col + 2]];
++histograms[3][src[col + 3]];
}
for(; col < width; ++col)
++histograms[0][src[col + 0]];
src += stride;
}
for(size_t i = 0; i < HISTOGRAM_SIZE; ++i)
histogram[i] = histograms[0][i] + histograms[1][i] +
histograms[2][i] + histograms[3][i];
}
回答3:
@Paul-R, there exists some paper at this link which discusses how to vectorize histogram functions:
SIMD Vectorization of Histogram Functions
回答4:
Some image processing algorithm working on histograms (e.g. equalization, histogram matching) can be made work with known percentiles -- and for an approximation one can effectively parallelize the search to initial ranges (0,25,50,75,100%) consuming 4 accumulators.
Each item in the input stream must be compared in parallel to all slots, incrementing the frequency. After a certain number of rounds (e.g. n*255 rounds guaranteeing no overflows on uint8_t data type, then accumulating those to uint16_t) the min/max limits in each slot are recalculated based on linear interpolation. And it's of course possible to re-run the sequence based on an estimation how much the new data has changed the estimates of the percentiles.
The algorithm would be variant to evaluation order, which could be mitigated by random sampling and multiple passes.