I'm looking to understand SSE2's capabilities a little more, and would like to know if one could make a 128-bit wide integer that supports addition, subtraction, XOR and multiplication? Thanks, Erkling.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
SSE2 has no carry flag but you can easily calculate the carry as carry = sum < a
or carry = sum < b
like this. But worse yet, SSE2 doesn't have 64-bit comparisons too, so you must use some workarounds like the one here
Here is an untested, unoptimized C code based on the idea above.
inline bool lessthan(__m128i a, __m128i b){
a = _mm_xor_si128(a, _mm_set1_epi32(0x80000000));
b = _mm_xor_si128(b, _mm_set1_epi32(0x80000000));
__m128i t = _mm_cmplt_epi32(a, b);
__m128i u = _mm_cmpgt_epi32(a, b);
__m128i z = _mm_or_si128(t, _mm_shuffle_epi32(t, 177));
z = _mm_andnot_si128(_mm_shuffle_epi32(u, 245),z);
return _mm_cvtsi128_si32(z) & 1;
}
inline __m128i addi128(__m128i a, __m128i b)
{
__m128i sum = _mm_add_epi64(a, b);
__m128i mask = _mm_set1_epi64(0x8000000000000000);
if (lessthan(_mm_xor_si128(mask, sum), _mm_xor_si128(mask, a)))
{
__m128i ONE = _mm_setr_epi64(0, 1);
sum = _mm_add_epi64(sum, ONE);
}
return sum;
}
As you can see, the code requires many more instructions and even after optimized it may still be much more longer than a simple 2 instruction add/adc in x86_64 (or 4 in x86)