Does anyone have any thoughts on how to calculate the mode (statistic) of a vector of 8-bit integers in SSE4.x? To clarify, this would be 16x8-bit values in a 128-bit register.
I want the result as a vector mask which selects the mode-valued elements. i.e. the result of _mm_cmpeq_epi8(v, set1(mode(v)))
, as well as the scalar value.
Providing some additional context; while the above problem is an interesting one to solve in its own right, I have been through most algorithms I can think of with linear complexity. This class will wipe out any gains I can get from calculating this number.
I hope to engage you all in searching for some deep magic, here. It's possible that an approximation may be necessary to break this bound, such as "select a frequently occurring element" for example (N.B. difference against the most), which would be of merit. A probabilistic answer would be usable, too.
SSE and x86 have some very interesting semantics. It may be worth exploring a superoptimization pass.
Sort the data in the register. Insertion sort can be done in 16 (15) steps, by initializing the register to "Infinity", which tries to illustrate a monotonically decreasing array and inserting the new element in parallel to all possible places:
Then calculate mask for
vec[n+1] == vec[n]
, move mask to GPR and use that to index a 32768 entry LUT for best index location.In real case one probably want's to sort more than just one vector; i.e. sort 16 16-entry vectors at once;
The inner loop totals amortized cost of 7-8 instructions per result; Sorting has typically 2 instructions per stage -- i.e. 8 instructions per result, when 16 results need 60 stages or 120 instructions. (This still leaves the transpose as an exercise -- but I think it should be vastly faster than sorting?)
So, this should be in the ball park of 24 instructions per 8-bit result.
Probably a relatively simple brute force SSEx approach is suitable here, see the code below. The idea is to byte-rotate the input vector
v
by 1 to 15 positions and compare the rotated vector with the originalv
for equality. To shorten the dependency chain and to increase the instruction level parallelism, two counters are used to count (vertical sum) these equal elements:sum1
andsum2
, because there might be architectures that benefit from this. Equal elements are counted as -1. Variablesum = sum1 + sum2
contains the total count with values between -1 and -16.min_brc
contains the horizontal minimum ofsum
broadcasted to all elements.mask = _mm_cmpeq_epi8(sum,min_brc)
is the mask for the mode-valued elements requested as an intermediate result by the OP. In the next few lines of the code the actual mode is extracted.This solution is certainly faster than a scalar solution. Note that with AVX2 the upper 128-bit lanes can be used to speedup the computation further.
It takes 20 cycles (throughput) to compute only the a mask for the mode-valued elements. With the actual mode broadcasted across the SSE register it takes about 21.4 cycles.
Note the behaviour in the next example:
[1, 1, 3, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
returnsmask=[-1,-1,-1,-1,0,0,...,0]
and the mode value is 1, although 1 occurs as often as 3.The code below is tested, but not thoroughly tested
Output:
The horizontal minimum is computed with Evgeny Kluev's method.
For performance comparison with scalar code. Non-vectorized on main part but vectorized on table-clear and tmp initialization. (168 cycles per f() call for fx8150 (22M calls complete in 1.0002 seconds at 3.7 GHz))
g++ 6.3 output: