I've been looking at MMX/SSE and I am wondering. There are instructions for packed, saturated subtraction of unsigned bytes and words, but not doublewords.
Is there a way of doing what I want, or if not, why is there none?
I've been looking at MMX/SSE and I am wondering. There are instructions for packed, saturated subtraction of unsigned bytes and words, but not doublewords.
Is there a way of doing what I want, or if not, why is there none?
If you have SSE4.1 available, I don't think you can get better than using the pmaxud
+psubd
approach suggested by @harold. With AVX2, you can of course also use the corresponding 256bit variants.
__m128i subs_epu32_sse4(__m128i a, __m128i b){
__m128i mx = _mm_max_epu32(a,b);
return _mm_sub_epi32(mx, b);
}
Without SSE4.1, you need to compare both arguments in some way. Unfortunately, there is no epu32
comparison (not before AVX512), but you can simulate one by first adding 0x80000000
(which is equivalent to xor-ing in this case) to both arguments:
__m128i cmpgt_epu32(__m128i a, __m128i b) {
const __m128i highest = _mm_set1_epi32(0x80000000);
return _mm_cmpgt_epi32(_mm_xor_si128(a,highest),_mm_xor_si128(b,highest));
}
__m128i subs_epu32(__m128i a, __m128i b){
__m128i not_saturated = cmpgt_epu32(a,b);
return _mm_and_si128(not_saturated, _mm_sub_epi32(a,b));
}
In some cases, it might be better to replace the comparison by some bit-twiddling of the highest bit and broadcasting that to every bit using a shift (this replaces a pcmpgtd
and three bit-logic operations (and having to load 0x80000000
at least once) by a psrad
and five bit-logic operations):
__m128i subs_epu32_(__m128i a, __m128i b) {
__m128i r = _mm_sub_epi32(a,b);
__m128i c = (~a & b) | (r & ~(a^b)); // works with gcc/clang. Replace by corresponding intrinsics, if necessary (note that `andnot` is a single instruction)
return _mm_srai_epi32(c,31) & r;
}
Godbolt-Link, also including adds_epu32
variants: https://godbolt.org/z/n4qaW1
Strangely, clang needs more register copies than gcc for the non-SSE4.1 variants. On the other hand, clang finds the pmaxud
optimization for the cmpgt_epu32
variant when compiled with SSE4.1: https://godbolt.org/z/3o5KCm