Shifiting xmm integer register values using non-AV

I have the following problem which I need to solve using anything other than AVX2.

I have 3 values stored in a m128i variable (the 4th value is not needed ) and need to shift those values by 4,3,5. I need two functions. One for the right logical shift by those values and another for the left logical shift.

Does anyone know a solution to the problem using SSE/AVX ? The only thing I could find was _mm_srlv_epi32() which is AVX2.

To add a little more information. Here is the code I am trying to optimize using SSE/AVX. It's part of my draughts/checkers -engine.

uint32_t Board::getMoversBlack(){
    const uint32_t nocc=~(BP|WP);
    const uint32_t BK = BP & K;
    uint32_t movers = (nocc >> 4) & BP;
    movers |= ((nocc & MASK_R3) >>3) & BP;
    movers |= ((nocc & MASK_R5) >>5) & BP;
    if (BK != 0) {
        movers |= (nocc << 4) & BK;
        movers |= ((nocc & MASK_L3) << 3) & BK;
        movers |= ((nocc & MASK_L5) <<5) & BK;
    }
    return movers;
}

Would appreciate any help.

回答1:

If you really need this (and can't avoid it by rearranging your data), You can fully/safely emulate _mm_srlv_epi32 without clobbering any high or low bits.

For compile-time constant counts, you can use a mix of left and right shifts with most of these.

Probably bad options:

Unpack to scalar: yuck. Kinda bad for compile-time constant counts, but even worse for runtime-variable counts, especially if you'd have to unpack a vector of counts. x86 variable-count shifts without BMI2 shrx have clunky semantics and decode to multiple uops on Intel SnB-family. They also take extra mov instructions to put the shift count in cl if it wasn't already there.
Do separate shifts and then blending to take the element from the vector that was shifted by that amount. This isn't great, but you might be able to reduce the cost of blending by zeroing the unwanted elements as you copy them. (e.g. if the high element is known to be zero, copy with pshufd to get a vector of {0,22,0,0} from a starting vector of {11,22,33, 0}, and repeat for {0,0,33,0}.)

So, zero the high element that you're not using, 2x pshufd to copy+shuffle zeros into place, 3x psrld with different counts, AND out the other elements in the vector you didn't copy, then OR the 3 vectors back together. (This takes more work if you aren't leaving one element of your vector unused.)

Depending on the rest of your code, and the microarchitecture, using a shuffle instead of MOVDQA+PAND might not be worth it. If any elements use the same shift count, this option becomes more attractive.

Also, you can blend the low element into a vector with movss, and blend the low half with movsd. Those use the shuffle port, so shuffle throughput could be a problem. This might actually be pretty solid.

Hopefully better options.

The SSE2 version of Marc's suggestion (see below) also works in the fully general case.
When the difference between smallest and largest shift count is <= the smallest shift count, you can use @Marc's SSE4.1 suggestion to use multiply as a variable left-shift to account for the differences in right-shift counts. Or on its own as a left-shift. This is probably the best for most cases, taking many fewer instructions even though vector-int multiply is slow.

__m128i srlv435_sse4(__m128i v)
{
    __m128i rshift = _mm_srli_epi32(v, 3);   // v >> 3
    // differences in shift count by multiplying by powers of 2
    __m128i vshift = _mm_mullo_epi32(rshift, _mm_setr_epi32(2,4,1,0)); // [ x >> 2,  y >> 1, z >> 3, 0 ]  Except with low bits truncated.
    __m128i shift2 = _mm_srli_epi32(vshift, 2);                        // [ x >> 4,  y >> 3, z >> 5, 0 ]
     return shift2;
}

This is nice because it operates in-place without the compiler needing any MOVDQA instructions to copy registers, even without AVX1.

Note that SSE4.1 _mm_mullo_epi32 is not fast: 2 uops for p0 on Haswell: 10c latency and one per 2c throughput. Better throughput on Skylake where each of the 2 uops can can run on p0 or p1, but still dependent with 10c latency. (http://agner.org/optimize/ and other links in the x86 tag wiki.)

This has better latency on pre-Haswell where pmulld is a single-uop instruction (~5 cycles, 1c throughput) instead of 2 dependent uops for 10 cycles.

On AMD Bulldozer-family and Ryzen, latency = 4 or 5, throughput = one per 2c for pmulld.

I didn't check on port conflicts with vector shifts.

Without SSE4.1, you can use 2x SSE2 _mm_mul_epu32 to do 2 multiplies at a time. To line up the odd elements (1 and 3), pshufd to copy+shuffle them down to positions 0 and 2 where pmuludq looks for them.

That produces 2 64-bit results from the even 2 32-bit elements, so you don't need to pre-shift to avoid overflow. It also means its safe to use when the difference between shift counts is larger than the minimum shift, so the SSE4.1 way couldn't keep all the required bits in the element with the most kept bits.

// general case: substitute in *any* shift counts and it still works.
__m128i srlv_sse2(__m128i v)  // [x  y  z  w]
{
    __m128i vs_even = _mm_mul_epu32(v, _mm_setr_epi32(1U<<1, 1U<<2, 1U<<0, 0));    // [ x<<1    z<<0 ]  (64-bit elements)
    // The 4 (1U<<2) is unused, but this lets us share a constant with the SSE4 version, saving rodata size.  (Compilers optimize duplicate constants for you; check the disassembly for same address)
    vs_even = _mm_srli_epi64(vs_even, 5);  // [ x>>4  0   x>>5  0 ]  (32-bit elements ready for blending with just an OR)

    __m128i odd = _mm_shuffle_epi32(v, _MM_SHUFFLE(3, 3, 1, 1));
    __m128i vs_odd =  _mm_mul_epu32(v, _mm_setr_epi32(1U<<(32-3),0,0,0));    // [ (y<<32) >> 3    0 ]  (64-bit elements)

    // If any elements need left shifts, you can't get them all the way out the top of the high half with a 32-bit power of 2.
    //vs_odd = _mm_slli_epi64(vs_odd, 32 - (3+2));       // [ garbage,  y>>3,  0, 0 ]

    // SSE2 doesn't have blend instructions, do it manually.
    __m128i vs_oddhi = _mm_and_si128(vs_odd, _mm_setr_epi32(0, -1, 0, -1));
    __m128i shifted = _mm_or_si128(vs_even, vs_oddhi);

     return shifted;
}

There are some obvious optimizations here:

Your case isn't using the 4th element, so the 2nd multiply is pointless: just shift and use an AND mask to clear the high element. vs_odd = _mm_srli_epi32v, 3); and use 0,-1,0,0 as your AND mask.

Instead of a left-shift by 1 and 0, add x to itself and leave z unmodified. Copying a vector with zeroing of the upper 64 bits is very cheap (movq), but not as cheap as movdqa (on CPUs with mov-elimination).

    __m128i rshift = _mm_srli_epi32(v, 3);         // v >> 3
    __m128i xy00   = _mm_move_epi64(rshift);
    __m128i vshift = _mm_add_epi32(rshift, xy00);       // [ x >> 2,  y >> 2, z >> 3, 0 ]

But this doesn't handle y. We could isolate y>>2 from vshift and add it again to produce y>>1. (But remember not to use the old y>>3 from xy00).

We could also consider using _mm_mul_epu32 (pmuludq) once and a copy+shift+AND for the other step (copy from the original v instead of rshift to shorten the dep chain). This is useful in your case because you're not using the top element, so there's only one valid odd element and you thus don't need a variable shift.

With a combination of movq, movss, and movsd, there might be something more to gain here from basically shifting the 3 elements separately. There are tradeoffs between port pressure, latency, uop count (front-end throughput) and whatnot. e.g. I'm thinking

movdqa  xmm1, xmm0
psrld   xmm0, 3        #  [ x>>3  y>>3    garbage ]
psrld   xmm1, 4        #  [ x>>4  y>>4    garbage ] 
movss   xmm1, xmm0     #  [ x>>3  y>>4    garbage ]   # FP shuffle

psrld   xmm0, 2        #  [ garbage        z>>5 ]
movsd   xmm0, xmm1     #  [ x>>3  y>>4     z>>5 ]     # FP shuffle

Haswell for example only has 1 shift per clock throughput, so this isn't wonderful. It has pretty good latency though, compared to the multiply options. It's good on Skylake where 2 ports can run vector immediate shifts.

FP shuffles between integer instructions is fine on Intel CPUs other than Nehalem (where it's a 2-cycle bypass-delay latency penalty each way, but throughput is still ok). I think it's fine on AMD as well.

Of course all those CPUs have SSE4.1, so if you use dynamic runtime dispatch the SSE2 version only has to work on Core2 / K10. (And I guess older Atom, or whatever).

code + asm output on Godbolt

回答2:

SSE2 Intel Architecture Instruction Set Extensions introduced shift operations for integers. Below I have inserted a list of available compiler intrinsics implementing logical shift operations in SSE2:

psllw
__m128i _mm_sll_epi16 (__m128i a, __m128i count)

pslld
__m128i _mm_sll_epi32 (__m128i a, __m128i count)

psllq
__m128i _mm_sll_epi64 (__m128i a, __m128i count)

psllw
__m128i _mm_slli_epi16 (__m128i a, int imm8)

pslld
__m128i _mm_slli_epi32 (__m128i a, int imm8)

psllq
__m128i _mm_slli_epi64 (__m128i a, int imm8)

pslldq
__m128i _mm_slli_si128 (__m128i a, int imm8)

psrlw
__m128i _mm_srl_epi16 (__m128i a, __m128i count)

psrld
__m128i _mm_srl_epi32 (__m128i a, __m128i count)

psrlq
__m128i _mm_srl_epi64 (__m128i a, __m128i count)

psrlw
__m128i _mm_srli_epi16 (__m128i a, int imm8)

psrld
__m128i _mm_srli_epi32 (__m128i a, int imm8)

psrlq
__m128i _mm_srli_epi64 (__m128i a, int imm8)

psrldq
__m128i _mm_srli_si128 (__m128i a, int imm8)

More information is available on Intel Intrinsics Guide website.

If limitation of the above intrinsics shifting all values by the same number of bits does not suit everyone it is possible to use multiplication by power of 2 and division by power of 2, however, this would have major impact on performance and most probably right shifting 3 32-bit integers each by different value will be faster than vector division. The same could be true for multiplication, however, first I would test it in code.