How do I clear the 16 - i
upper bytes of a __m128i
?
I've tried this; it works, but I'm wondering if there is a better (shorter, faster) way:
int i = ... // 0 < i < 16
__m128i x = ...
__m128i mask = _mm_set_epi8(
0,
(i > 14) ? -1 : 0,
(i > 13) ? -1 : 0,
(i > 12) ? -1 : 0,
(i > 11) ? -1 : 0,
(i > 10) ? -1 : 0,
(i > 9) ? -1 : 0,
(i > 8) ? -1 : 0,
(i > 7) ? -1 : 0,
(i > 6) ? -1 : 0,
(i > 5) ? -1 : 0,
(i > 4) ? -1 : 0,
(i > 3) ? -1 : 0,
(i > 2) ? -1 : 0,
(i > 1) ? -1 : 0,
-1);
x = _mm_and_si128(x, mask);
I tried a few different ways of implementing this and benchmarked them with a couple of different compilers on an early Core i7 @ 2.67 GHz and a recent Haswell @ 3.6 GHz:
Results were interesting:
Core i7 @ 2.67 GHz, Apple LLVM gcc 4.2.1 (gcc -O3)
Core i7 @ 2.67 GHz, Apple clang 4.2 (clang -Os)
Haswell E3-1285 @ 3.6 GHz, gcc 4.7.2 (gcc -O2)
So
mask_shift_4
(switch/case) seems to be the slowest method in all cases, whereas the others are pretty similar. The LUT-based methods seem to be consistently the fastest overall.NB: I get some suspiciously fast numbers with
clang -O3
andgcc -O3
(gcc 4.7.2 only) - I need to look at the generated assembly for these cases to see what the compiler is doing, and make sure it is not doing anything "clever", such as optimise away some part of the timing test harness.If anyone else has any further ideas on this or has another mask_shift implementation they'd like to try I would be happy to add it to the test suite and update the results.
If it were normal 64bit values, i'd use something like -
But take care when generalizing this to 128, the internal shift operators aren't necessarily working at these ranges.
For 128b, you could either just build an upper and lower masks, for e.g -
(assuming I didn't swap the order, check me on this one, i'm not very familiar with these intrinsics) Alternatively, you can do this on a 2-wide uint64 array and load the 128b mask directly from memory using it's address.
However, both these methods don't seem natural like the original one, they just extend the elements from 1 to 8 bytes, but are still partial. It would be much preferable to do a proper shift with a single 128b variable.
I just came across this topic regarding 128b shifts -
Looking for sse 128 bit shift operation for non-immediate shift value
looks like it's possible but i've never used it. You could try the above one-liner with the appropriate SSE intrinsitc from there. I'd give this one a shot -
And then just subtract one using your preferred way (I'd be surprised if this type supports a plain old operator-)