I'm interested in identifying overflowing values when adding unsigned 8-bit integers, and saturating the result to 0xFF:
__m128i m1 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m2 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m3 = _mm_adds_epu8(m1, m2);
I would be interested in performing comparison for less than on these unsigned integers, similar to _mm_cmplt_epi8
for signed:
__m128i mask = _mm_cmplt_epi8 (m3, m1);
m1 = _mm_or_si128(m3, mask);
If an "epu8" equivalent was available, mask
would have 0xFF
where m3[i] < m1[i]
(overflow!), 0x00 otherwise
, and we would be able to saturate m1
using the "or", so m1
will hold the addition result where valid, and 0xFF
where it overflowed.
Problem is, _mm_cmplt_epi8
performs a signed comparison, so for instance if m1[i] = 0x70
and m2[i] = 0x10
, then m3[i] = 0x80
and mask[i] = 0xFF
, which is obviously not what I require.
Using VS2012.
I would appreciate another approach for performing this. Thanks!
One way of implementing compares for unsigned 8 bit vectors is to exploit _mm_max_epu8
, which returns the maximum of unsigned 8 bit int elements. You can compare for equality the (unsigned) maximum value of two elements with one of the source elements and then return the appropriate result. This translates to 2 instructions for >=
or <=
, and 3 instructions for >
or <
.
Example code:
#include <stdio.h>
#include <emmintrin.h> // SSE2
#define _mm_cmpge_epu8(a, b) \
_mm_cmpeq_epi8(_mm_max_epu8(a, b), a)
#define _mm_cmple_epu8(a, b) _mm_cmpge_epu8(b, a)
#define _mm_cmpgt_epu8(a, b) \
_mm_xor_si128(_mm_cmple_epu8(a, b), _mm_set1_epi8(-1))
#define _mm_cmplt_epu8(a, b) _mm_cmpgt_epu8(b, a)
int main(void)
{
__m128i va = _mm_setr_epi8(0, 0, 1, 1, 1, 127, 127, 127, 128, 128, 128, 254, 254, 254, 255, 255);
__m128i vb = _mm_setr_epi8(0, 255, 0, 1, 255, 0, 127, 255, 0, 128, 255, 0, 254, 255, 0, 255);
__m128i v_ge = _mm_cmpge_epu8(va, vb);
__m128i v_le = _mm_cmple_epu8(va, vb);
__m128i v_gt = _mm_cmpgt_epu8(va, vb);
__m128i v_lt = _mm_cmplt_epu8(va, vb);
printf("va = %4vhhu\n", va);
printf("vb = %4vhhu\n", vb);
printf("v_ge = %4vhhu\n", v_ge);
printf("v_le = %4vhhu\n", v_le);
printf("v_gt = %4vhhu\n", v_gt);
printf("v_lt = %4vhhu\n", v_lt);
return 0;
}
Compile and run:
$ gcc -Wall _mm_cmplt_epu8.c && ./a.out
va = 0 0 1 1 1 127 127 127 128 128 128 254 254 254 255 255
vb = 0 255 0 1 255 0 127 255 0 128 255 0 254 255 0 255
v_ge = 255 0 255 255 0 255 255 0 255 255 0 255 255 0 255 255
v_le = 255 255 0 255 255 0 255 255 0 255 255 0 255 255 0 255
v_gt = 0 0 255 0 0 255 0 0 255 0 0 255 0 0 255 0
v_lt = 0 255 0 0 255 0 0 255 0 0 255 0 0 255 0 0
The other answers got me thinking of a simpler method to answer the specific question more directly:
To simply detect clamping, do saturating and non-saturating additions, and compare the results.
__m128i m1 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m2 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m1m2_sat = _mm_adds_epu8(m1, m2);
__m128i m1m2_wrap = _mm_add_epi8(m1, m2);
__m128i non_clipped = _mm_cmpeq_epi8(m1m2_sat, m1m2_wrap);
So that's just two instructions beyond the adds
, and one of them can run in parallel with the adds
. So the non_clipped
mask is ready one cycle after the addition result. (Potentially 3 instructions (an extra movdqa) without AVX 3-operand non-destructive vector ops.)
If the non-saturating add result is 0xFF, it will match the saturating-add result, and be detected as not clipping. This is why it's different from just checking the output of the saturating add for 0xFF bytes.
Another way to compare unsigned bytes: add 0x80
and compare them as signed ones.
__m128i _mm_cmplt_epu8(__m128i a, __m128i b) {
__m128i as = _mm_add_epi8(a, _mm_set1_epi8((char)0x80));
__m128i bs = _mm_add_epi8(b, _mm_set1_epi8((char)0x80));
return _mm_cmplt_epi8(as, bs);
}
I don't think it is very efficient, but it works, and it may be useful in some cases. Also, you can use xor instead of addition if you want.
In some cases you can even do bidirectional range checking at once, i.e. compare a value with both lower and upper bounds. To do so, align the lower bound with 0x80
, similar to what this answer does.
There is an implementation of comparison of 8-bit unsigned integer:
inline __m128i NotEqual8u(__m128i a, __m128i b)
{
return _mm_andnot_si128(_mm_cmpeq_epi8(a, b), _mm_set1_epi8(-1));
}
inline __m128i Greater8u(__m128i a, __m128i b)
{
return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_min_epu8(a, b), a), _mm_set1_epi8(-1));
}
inline __m128i GreaterOrEqual8u(__m128i a, __m128i b)
{
return _mm_cmpeq_epi8(_mm_max_epu8(a, b), a);
}
inline __m128i Lesser8u(__m128i a, __m128i b)
{
return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_max_epu8(a, b), a), _mm_set1_epi8(-1));
}
inline __m128i LesserOrEqual8u(__m128i a, __m128i b)
{
return _mm_cmpeq_epi8(_mm_min_epu8(a, b), a);
}