I would like to translate this code using SSE intrinsics.
for (uint32_t i = 0; i < length; i += 4, src += 4, dest += 4)
{
uint32_t value = *(uint32_t*)src;
*(uint32_t*)dest = ((value >> 16) & 0xFFFF) | (value << 16);
}
Is anyone aware of an intrinsic to perform the 16-bit word swapping?
The scalar code in your question isn't really byte swapping (in the sense of endianness conversion, at least) - it's just swapping the high and low 16 bits within a 32 bit word. If this is what you want though then just re-use the solution to your previous question, with appropriate changes:
pshufb
(SSSE3) should be faster than 2 shifts and an OR. Also, a slight modification to the shuffle mask would enable an endian conversion, instead of just a word-swap.stealing Paul R's function structure, just replacing the vector intrinsics:
pshufb
can have a memory operand, but it has to be the shuffle mask, not the data to be shuffled. So you can't use it as a shuffled-load. :/gcc doesn't generate great code for the loop. The main loop is
With all that loop overhead, and needing a separate load and store instruction, throughput will only be 1 shuffle per 2 cycles. (8 uops, since
cmp
macro-fuses withjbe
).A faster loop would be
movdqu
loads can micro-fuse with complex addressing modes, unlike vector ALU ops, so all these instructions are single-uop except the store, I believe.This should run at 1 cycle per iteration with some unrolling, since
add
can micro-fuse withjl
. So the loop has 5 total uops. 3 of them are load/store ops, which have dedicated ports. Bottlenecks are:pshufb
can only run on one execution port (Haswell (SnB/IvB canpshufb
on ports 1&5)). One store per cycle (all microarches). And finally, the 4 fused-domain uops per clock limit for Intel CPUs, which should be reachable barring cache-misses on Nehalem and later (uop loop buffer).Unrolling would bring the total fused-domain uops per 16B down below 4. Incrementing pointers, instead of using complex addressing modes, would let the stores micro-fuse. (Reducing loop overhead is always good: letting the re-order buffer fill up with future iterations means the CPU has something to do when it hits a mispredict at the end of the loop and returns to other code.)
This is pretty much what you'd get by unrolling the intrinsics loops, as Elalfer rightly suggests would be a good idea. With gcc, try
-funroll-loops
if that doesn't bloat the code too much.BTW, it's probably going to be better to byte-swap while loading or storing, mixed in with other code, rather than converting a buffer as a separate operation.