I would need to do horizontal xor of two 128bit integers (by 32bit integers) and combine the results to one 64bit integer. So operation like this:
uint32_t x0[4];
uint32_t x1[4];
uint32_t xor0 = x0[0];
uint32_t xor1 = x1[0];
for (int i = 1; i < 4; ++i) {
xor0 ^= x0[i];
xor1 ^= x1[i];
}
uint64_t xor = uint64_t(xor1) << 32 | xor0;
I finally found following code, that seems to work:
__m128i x0 = ...;
__m128i x1 = ...;
__m128i xor64_0 = _mm_unpackhi_epi64(x0, x1);
__m128i xor64_1 = _mm_unpacklo_epi64(x0, x1);
__m128i xor64 = _mm_xor_si128(xor64_0, xor64_1);
__m128i xor32_0 = _mm_shuffle_epi32(xor64, _MM_SHUFFLE(3, 1, 2, 0));
__m128i xor32_1 = _mm_shuffle_epi32(xor64, _MM_SHUFFLE(2, 0, 3, 1));
__m128i xor32 = _mm_xor_si128(xor32_0, xor32_1);
uint64_t xor = _mm_cvtsi128_si64(xor32);
Is this the fastest possible implementation? Would it make sense to combine integer and floating-point operations, like _mm_movehdup_ps(.) ?