Methods to vectorise histogram in SIMD?

2019-01-14 16:37发布

I am trying to implement histogram in Neon. Is it possible to vectorise ?

4条回答
不美不萌又怎样
2楼-- · 2019-01-14 17:18

Histogramming is almost impossible to vectorize, unfortunately.

You can probably optimise the scalar code somewhat however - a common trick is to use two histograms and then combine them at the end. This allows you to overlap loads/increments/stores and thereby bury some of the serial dependencies and associated latencies. Pseudo code:

init histogram 1 to all 0s
init histogram 2 to all 0s
loop
  get input value 1
  get input value 2
  load count for value 1 from histogram 1
  load count for value 2 from histogram 2
  increment count for histogram 1
  increment count for histogram 2
  store count for value 1 to histogram 1
  store count for value 2 to histogram 2
until done
combine histogram 1 and histogram 2
查看更多
小情绪 Triste *
3楼-- · 2019-01-14 17:20

ermig1979 has a Simd project which shows how he has done histograms using a similar approach to what @Paul-R has mentioned but also with SSE2 and AVX2 variants:

Project: https://github.com/ermig1979/Simd

Base file: https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdBaseHistogram.cpp

An AVX2 implementation can be seen here: https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdAvx2Histogram.cpp

A scalar solution can be seen below to illustrate the basic principle of creating multiple histograms that are summed at the end:

void Histogram(const uint8_t * src, size_t width, size_t height, size_t stride, 
    uint32_t * histogram)
{
    uint32_t histograms[4][HISTOGRAM_SIZE];
    memset(histograms, 0, sizeof(uint32_t)*HISTOGRAM_SIZE*4);
    size_t alignedWidth = Simd::AlignLo(width, 4);
    for(size_t row = 0; row < height; ++row)
    {
        size_t col = 0;
        for(; col < alignedWidth; col += 4)
        {
            ++histograms[0][src[col + 0]];
            ++histograms[1][src[col + 1]];
            ++histograms[2][src[col + 2]];
            ++histograms[3][src[col + 3]];
        }
        for(; col < width; ++col)
            ++histograms[0][src[col + 0]];

        src += stride;
    }

    for(size_t i = 0; i < HISTOGRAM_SIZE; ++i)
        histogram[i] = histograms[0][i] + histograms[1][i] + 
            histograms[2][i] + histograms[3][i];
}
查看更多
祖国的老花朵
4楼-- · 2019-01-14 17:20

Some image processing algorithm working on histograms (e.g. equalization, histogram matching) can be made work with known percentiles -- and for an approximation one can effectively parallelize the search to initial ranges (0,25,50,75,100%) consuming 4 accumulators.

Each item in the input stream must be compared in parallel to all slots, incrementing the frequency. After a certain number of rounds (e.g. n*255 rounds guaranteeing no overflows on uint8_t data type, then accumulating those to uint16_t) the min/max limits in each slot are recalculated based on linear interpolation. And it's of course possible to re-run the sequence based on an estimation how much the new data has changed the estimates of the percentiles.

The algorithm would be variant to evaluation order, which could be mitigated by random sampling and multiple passes.

查看更多
三岁会撩人
5楼-- · 2019-01-14 17:40

@Paul-R, there exists some paper at this link which discusses how to vectorize histogram functions:

SIMD Vectorization of Histogram Functions

查看更多
登录 后发表回答