I wrote the function int compare_16bytes(__m128i lhs, __m128i rhs)
in order to compare two 16 byte numbers using SSE instructions: this function returns how many bytes are equal after performing the comparison.
Now I would like use the above function in order to compare two byte arrays of arbitrary length: the length may not be a multiple of 16 bytes, so I need deal with this problem. How could I complete the implementation of the function below? How could I improve the function below?
int fast_compare(const char* s, const char* t, int length)
{
int result = 0;
const char* sPtr = s;
const char* tPtr = t;
while(...)
{
const __m128i* lhs = (const __m128i*)sPtr;
const __m128i* rhs = (const __m128i*)tPtr;
// compare the next 16 bytes of s and t
result += compare_16bytes(*lhs,*rhs);
sPtr += 16;
tPtr += 16;
}
return result;
}
As @Mysticial says in the comments above, do the compare and sum vertically and then just sum horizontally at the end of the main loop:
Compile and run the above test harness:
Note that there is one possibly non-obvious trick in the above SSE code where we use
_mm_madd_epi16
to unpack and accumulate 16 bit0
/-1
values to 32 bit partial sums. We take advantage of the fact that-1*-1 = 1
(and0*0 = 0
of course) - we're not really doing a multiply here, just unpacking and summing in one instruction.UPDATE: as noted in the comments below, this solution is not optimal - I just took a fairly optimal 16 bit solution and added 8 bit to 16 bit unpacking to make it work for 8 bit data. However for 8 bit data there are more efficient methods, e.g. using
psadbw
/_mm_sad_epu8
. I'll leave this answer here for posterity, and for anyone who might want to do this kind of thing with 16 bit data, but really one of the other answers which doesn't require unpacking the input data should be the accepted answer.Using partial sums in 16 x uint8 elements may give even better performance.
I have divided the loop into inner loop and outer loop.
The inner loop sum uint8 elements (each uint8 element can sum up to 255 "1"s).
Small trick: _mm_cmpeq_epi8 set equal elements to 0xFF, and (char)0xFF = -1, so you can subtract the result from the sum (subtract -1 for adding 1).
Here is my optimized version for fast_compare:
The fastest way for large inputs is Rotem's answer, where the inner loop is
pcmpeqb
/psubb
, breaking out to horizontally sum before any byte element of the vector accumulator overflows. Do the hsum of unsigned bytes withpsadbw
against an all-zero vector.Without unrolling / nested loops, the best option is probably
If you don't have a lot of register pressure in your loop,
psadbw
against a vector of0x7f
instead of all-zero.psadbw(0x00, set1(0x7f))
=>sum += 0x7f
psadbw(0xff, set1(0x7f))
=>sum += 0x80
So instead of dividing by 255 (which the compiler should do efficiently without an actual
div
), you just have to subtractn * 0x7f
, wheren
is the number of elements.Also note that
paddq
is slow on pre-Nehalem, and Atom, so you could usepaffffd
(_mm_add_epi32
) if you don't expect 128 * the count to ever overflow a 32bit integer.This compares very well with the Paul R's
pcmpeqb
/ 2xpunpck
/ 2xpmaddwd
/ 2xpaddw
.The integer comparison in SSE produces bytes that either all zeros or all ones. If you want to count, you first need to right shift (not arithmetic) the comparison result by 7, then add to the result vector. At the end, you still need to reduce the result vector by summing its elements. This reduction has to be done in scalar code, or with a sequence of add/shifts. Usually this part is not worth troubling with.