SSE byte and half word swapping

I would like to translate this code using SSE intrinsics.

for (uint32_t i = 0; i < length; i += 4, src += 4, dest += 4)
{
    uint32_t value = *(uint32_t*)src;
    *(uint32_t*)dest = ((value >> 16) & 0xFFFF) | (value << 16);
}

Is anyone aware of an intrinsic to perform the 16-bit word swapping?

回答1:

pshufb (SSSE3) should be faster than 2 shifts and an OR. Also, a slight modification to the shuffle mask would enable an endian conversion, instead of just a word-swap.

stealing Paul R's function structure, just replacing the vector intrinsics:

void word_swapping_ssse3(uint32_t* dest, const uint32_t* src, size_t count)
{
    size_t i;
    __m128i shufmask =  _mm_set_epi8(13,12, 15,14,  9,8, 11,10,  5,4, 7,6,  1,0, 3,2);
    // _mm_set args go in big-endian order for some reason.                       

    for (i = 0; i + 4 <= count; i += 4)
    {
        __m128i s = _mm_loadu_si128((__m128i*)&src[i]);
        __m128i d = _mm_shuffle_epi8(s, shufmask);
        _mm_storeu_si128((__m128i*)&dest[i], d);
    }
    for ( ; i < count; ++i) // handle residual elements
    {
        uint32_t w = src[i];
        w = (w >> 16) | (w << 16);
        dest[i] = w;
    }
}

pshufb can have a memory operand, but it has to be the shuffle mask, not the data to be shuffled. So you can't use it as a shuffled-load. :/

gcc doesn't generate great code for the loop. The main loop is

# src: r8.  dest: rcx.  count: rax.  shufmask: xmm1
.L16:
        movq    %r9, %rax
.L3:  # first-iteration entry point
        movdqu  (%r8), %xmm0
        leaq    4(%rax), %r9
        addq    $16, %r8
        addq    $16, %rcx
        pshufb  %xmm1, %xmm0
        movups  %xmm0, -16(%rcx)
        cmpq    %rdx, %r9
        jbe     .L16

With all that loop overhead, and needing a separate load and store instruction, throughput will only be 1 shuffle per 2 cycles. (8 uops, since cmp macro-fuses with jbe).

A faster loop would be

  shl $2, %rax  # uint count  ->  byte count
  # check for %rax less than 16 and skip the vector loop
  # cmp / jsomething
  add %rax, %r8  # set up pointers to the end of the array
  add %rax, %rcx
  neg %rax       # and count upwards toward zero
.loop:
  movdqu (%r8, %rax), %xmm0
  pshufb  %xmm1, %xmm0
  movups  %xmm0, (%rcx, %rax)  # IDK why gcc chooses movups for stores.  Shorter encoding?
  add $16, %rax
  jl .loop
  # ...
  # scalar cleanup

movdqu loads can micro-fuse with complex addressing modes, unlike vector ALU ops, so all these instructions are single-uop except the store, I believe.

This should run at 1 cycle per iteration with some unrolling, since add can micro-fuse with jl. So the loop has 5 total uops. 3 of them are load/store ops, which have dedicated ports. Bottlenecks are: pshufb can only run on one execution port (Haswell (SnB/IvB can pshufb on ports 1&5)). One store per cycle (all microarches). And finally, the 4 fused-domain uops per clock limit for Intel CPUs, which should be reachable barring cache-misses on Nehalem and later (uop loop buffer).

Unrolling would bring the total fused-domain uops per 16B down below 4. Incrementing pointers, instead of using complex addressing modes, would let the stores micro-fuse. (Reducing loop overhead is always good: letting the re-order buffer fill up with future iterations means the CPU has something to do when it hits a mispredict at the end of the loop and returns to other code.)

This is pretty much what you'd get by unrolling the intrinsics loops, as Elalfer rightly suggests would be a good idea. With gcc, try -funroll-loops if that doesn't bloat the code too much.

BTW, it's probably going to be better to byte-swap while loading or storing, mixed in with other code, rather than converting a buffer as a separate operation.

回答2:

The scalar code in your question isn't really byte swapping (in the sense of endianness conversion, at least) - it's just swapping the high and low 16 bits within a 32 bit word. If this is what you want though then just re-use the solution to your previous question, with appropriate changes:

void byte_swapping(uint32_t* dest, const uint32_t* src, size_t count)
{
    size_t i;
    for (i = 0; i + 4 <= count; i += 4)
    {
        __m128i s = _mm_loadu_si128((__m128i*)&src[i]);
        __m128i d = _mm_or_si128(_mm_slli_epi32(s, 16), _mm_srli_epi32(s, 16));
        _mm_storeu_si128((__m128i*)&dest[i], d);
    }
    for ( ; i < count; ++i) // handle residual elements
    {
        uint32_t w = src[i];
        w = (w >> 16) | (w << 16);
        dest[i] = w;
    }
}