STL implements a generic std::swap
function to swap 2 values. It can be presented in the following way:
template <class T> void swap (T& a, T& b)
{
T c(std::move(a));
a=std::move(b);
b=std::move(c);
}
However, there is a XOR swap algorithm to swap 2 integers (http://en.wikipedia.org/wiki/XOR_swap_algorithm):
void swap_u( size_t& x, size_t& y )
{
x = x^y;
y = x^y;
x = x^y;
}
My questions:
- Is it an optimization nowadays (on
x86
or arm
)?
- Does C++ standard favor this kind of optimization?
- Are there any real STL implementations in the wild that have
std::swap
specialization for integers?
In the vast majority of situations, XOR swap is not an optimisation.
See this wiki entry.
In most practical scenarios, the trivial swap algorithm using a temporary register is more efficient. Limited situations in which XOR swapping may be practical include:
- On a processor where the instruction set encoding permits the XOR swap to be encoded in a smaller number of bytes;
- In a region with high register pressure, it may allow the register allocator to avoid spilling a register.
- In microcontrollers where available RAM is very limited.
Because these situations are rare, most optimizing compilers do not generate XOR swap code.
Also note that your implementation of XOR swap is broken. You need to first check that x and y aren't aliased. This check will definitely make XOR swap slower.
I'm not aware of any standard library implementation that uses XOR swap.
Note that, regardless of what the standard library implements, if XOR swap were really faster than normal swap then optimizing compilers would do a peephole optimization to turn it into an XOR swap. This really is a case of just letting the compiler choose for you.
XOR swap is really only a gimmick and can fail in certain cases (e.g. both variables are references to the same object).
XOR swap is also not particularly efficient as it has serial dependencies so it will always take at least three instruction cycles. Using a straightforward swap with a temporary has fewer dependencies, allowing for some parallelism on modern superscalar CPUs - on some CPUs it can even be implemented in one instruction, but even without special instructions it may well execute in two cycles.
On X86, a triple XOR swap between memory locations (not CPU registers) takes the same processor cycles as a triple copy. They can be even less if the temporary is a register.