SSE2 shift by vector

2019-06-12 11:49发布

问题:

I've been trying to implement shift by vector in SSE2 intrinsics, but from experimentation and the intel intrinsic guide, it appears to only use the least-significant part of the vector.

To reword my question, given a vector {v1, v2, ..., vn} and a set of shifts {s1, s2, ..., sn}, how do I calculate a result {r1, r2, ..., rn} such that:

r1 = v1 << s1
r2 = v2 << s2
...
rn = vn << sn

since it appears that _mm_sll_epi* performs this:

r1 = v1 << s1
r2 = v2 << s1
...
rn = vn << s1

Thanks in advance.

EDIT:

Here's the code I have:

#include <iostream>

#include <cstdint>

#include <mmintrin.h>
#include <emmintrin.h>

namespace SIMD {

    using namespace std;

    class SSE2 {
    public:
        // flipped operands due to function arguments
        SSE2(uint64_t a, uint64_t b, uint64_t c, uint64_t d) { low = _mm_set_epi64x(b, a); high = _mm_set_epi64x(d, c); }

        uint64_t& operator[](int idx)
        {
            switch (idx) {
            case 0:
                _mm_storel_epi64((__m128i*)result, low);
                return result[0];
            case 1:
                _mm_store_si128((__m128i*)result, low);
                return result[1];
            case 2:
                _mm_storel_epi64((__m128i*)result, high);
                return result[0];
            case 3:
                _mm_store_si128((__m128i*)result, high);
                return result[1];
            }

            /* Undefined behaviour */
            return 0;
        }

        SSE2& operator<<=(const SSE2& rhs)
        {
            low  = _mm_sll_epi64(low,  rhs.getlow());
            high = _mm_sll_epi64(high, rhs.gethigh());

            return *this;
        }

        void print()
        {
            uint64_t a[2];
            _mm_store_si128((__m128i*)a, low);

            cout << hex;
            cout << a[0] << ' ' << a[1] << ' ';

            _mm_store_si128((__m128i*)a, high);

            cout << a[0] << ' ' << a[1] << ' ';
            cout << dec;
        }

        __m128i getlow() const
        {
            return low;
        }

        __m128i gethigh() const
        {
            return high;
        }
    private:
        __m128i low, high;
        uint64_t result[2];
    };
}

int main()
{
    cout << "operator<<= test: vector << vector: ";
    {
        auto x = SIMD::SSE2(7, 8, 15, 10);
        auto y = SIMD::SSE2(4, 5,  6,  7);

        x.print();
        y.print();

        x <<= y;

        if (x[0] != 112 || x[1] != 256 || x[2] != 960 || x[3] != 1280) {
            cout << "FAILED: ";
            x.print();
            cout << endl;
        } else {
            cout << "PASSED" << endl;
        }
    }

    return 0;
}

What should be happening gets results of {7 << 4 = 112, 8 << 5 = 256, 15 << 6 = 960, 10 << 7 = 1280}. The results seem to be {7 << 4 = 112, 8 << 4 = 128, 15 << 6 = 960, 15 << 6 = 640}, which isn't what I want.

Hope this helps, Jens.

回答1:

If AVX2 is available, and your elements are 32 or 64 bits, your operation takes one variable-shift instruction: vpsrlvq, (__m128i _mm_srlv_epi64 (__m128i a, __m128i count) )


For 32bit elements with SSE4.1, see Shifting 4 integers right by different values SIMD. Depending on latency vs. throughput requirements, you can do separate shifts shift and then blend, or use a multiply (by a specially-constructed vector of powers of 2) to get variable-count left shifts and then do a same-count-for-all-elements right shift.


For your case, 64bit elements with runtime-variable shift counts:

There are only two elements per SSE vector, so we just need two shifts and then combine the results (which we can do with a pblendw, or with a floating-point movsd (which may cause extra bypass-delay latency on some CPUs), or we can use two shuffles, or we can do two ANDs and an OR.

__m128i SSE2_emulated_srlv_epi64(__m128i a, __m128i count)
{
    __m128i shift_low = _mm_srl_epi64(a, count);          // high 64 is garbage
    __m128i count_high = _mm_unpackhi_epi64(count,count); // broadcast the high element
    __m128i shift_high = _mm_srl_epi64(a, count_high);    // low 64 is garbage
    // SSE4.1:
    // return _mm_blend_epi16(shift_low, shift_high, 0x0F);

#if 1   // use movsd to blend
    __m128d blended = _mm_move_sd( _mm_castsi128_pd(shift_high), _mm_castsi128_pd(shift_low) );  // use movsd as a blend.  Faster than multiple instructions on most CPUs, but probably bad on Nehalem.
    return _mm_castpd_si128(blended);
#else  // SSE2 without using FP instructions:
    // if we're going to do it this way, we could have shuffled the input before shifting.  Probably not helpful though.
    shift_high = _mm_unpackhi_epi64(shift_high, shift_high);       // broadcast the high64
    return       _mm_unpacklo_epi64(shift_high, shift_low);        // combine
#endif
}

Other shuffles like pshufd or psrldq would work, but punpckhqdq gets the job done without needing an immediate byte, so it's one byte shorter. SSSE3 palignr could get the high element from one register and the low element from another register into one vector, but they'd be reversed (so we'd need a pshufd to swap high and low halves). shufpd would work to blend, but has no advantage over movsd.

See Agner Fog's microarch guide for the details of the potential bypass-delay latency from using an FP instruction between two integer instructions. It's probably fine on Intel SnB-family CPUs, because other FP shuffles are. (And yes, movsd xmm1, xmm0 runs on the shuffle unit in port5. Use movaps or movapd for reg-reg moves even of scalars if you don't need the merging behaviour).


This compiles (on Godbolt with gcc5.3 -O3) to

    movdqa  xmm2, xmm0  # tmp97, a
    psrlq   xmm2, xmm1    # tmp97, count
    punpckhqdq      xmm1, xmm1  # tmp99, count
    psrlq   xmm0, xmm1    # tmp100, tmp99
    movsd   xmm0, xmm2    # tmp102, tmp97
    ret