What is the best practice for swapping __m128i
variables?
The background is a compile error under Sun Studio 12.2, which is a C++03 compiler. __m128i
is an opaque type used with MMX and SSE instructions, and its usually and unsigned long long[2]
. C++03 does not provide the support for swapping arrays, and std:swap(__m128i a, __m128i b)
fails under the compiler.
Here are some related questions that don't quite hit the mark. They don't apply because std::vector
is not available.
swap via
memcpy
?This doesn't sound like a best-practices issue; it sounds like you need a workaround for a seriously broken implementation of intrinsics. If
__m128i tmp = a;
doesn't compile, that's pretty bad.If you're going to write a custom swap function, keep it simple.
__m128i
is a POD type that fits in a single vector register. Don't do anything that will encourage the compiler to spill it to memory. Some compilers will generate really horrible code even for a trivial test-case, and even gcc/clang might trip over a memcpy as part of optimizing a big complicated function.Since the compiler is choking on the constructor, just declare a tmp variable with a normal initializer, and use
=
assignment to do the copying. That always works efficiently in any compiler that supports__m128i
, and is a common pattern.Plain assignment to/from values in memory works like
_mm_store_si128
/_mm_load_si128
: i.e.movdqa
aligned stores/loads that will fault if used on unaligned addresses. (Of course, optimization can result in loads getting folded into memory operands to another vector instruction, or stores not happening at all.)Test cases: optimal code even with a crusty compiler like ICC13 which does a terrible job with the memcpy version. asm output from the Godbolt compiler explorer, with icc13
-O3
With memswap, you get something like
This is pretty much the absolute maximum amount of spilling/reloading you could imagine to swap two registers, because icc13 doesn't optimize between the inlined
memcpy
s at all, or even remember what is left in a register.Swapping values already in memory
Even gcc makes worse code with the memcpy version. It does the copy with 64bit integer loads/stores instead of a 128bit vector load/store. This is terrible if you're about to load the vector (store-forwarding stall), and otherwise is just bad (more uops to do the same work).