Optimal SSE unsigned 8 bit compare

2019-03-20 04:39发布

问题:

I'm trying to find the most way of performing 8 bit unsigned compares using SSE (up to SSE 4.2).

The most common case I'm working on is comparing for > 0U, e.g.

_mm_cmpgt_epu8(v, _mm_setzero_si128())                // #1

(which of course can also be considered to be a simple test for non-zero.)

But I'm also somewhat interested in the more general case, e.g.

_mm_cmpgt_epu8(v1, v2)                                // #2

The first case can be implemented with 2 instructions, using various different methods, e.g. compare with 0 and then invert the result. The second case typically requires 3 instructions, e.g. subtract 128 from both operands and perform signed compare. (See this question for various 3 instruction solutions.)

What I'm looking for ideally is a single instruction solution for #1, and a two instruction solution for #2. If neither of these are possible then I'm also interested in thoughts as to which of the various possible 2 or 3 instruction implementations is most efficient on modern Intel CPUs (Sandy Bridge, Ivy Bridge, Haswell).


Best implementations for case #2 so far:

    1. compare equal with unsigned max and invert result:

#define _mm_cmpgt_epu8(v0, v1) \ _mm_andnot_si128(_mm_cmpeq_epi8(_mm_max_epu8(v0, v1), v1), \ _mm_set1_epi8(-1))

Two arithmetic instructions + one bitwise = 1.33 throughput.

    1. invert sign bits for both arguments (== subtract 128) and use signed compare:

#define _mm_cmpgt_epu8(v0, v1) \ _mm_cmpgt_epi8(_mm_xor_si128(v0, _mm_set1_epi8(-128)), \ _mm_xor_si128(v1, _mm_set1_epi8(-128)))

One arithmetic instruction + two bitwise = 1.16 throughput.


Best implementations for case #1, derived from case #2 implementations above:

  • 1.

#define _mm_cmpgtz_epu8(v0) \ _mm_andnot_si128(_mm_cmpeq_epi8(v0, _mm_set1_epi8(0)), \ _mm_set1_epi8(-1))

One arithmetic instruction + one bitwise = 0.83 throughput.

  • 2.

#define _mm_cmpgtz_epu8(v0) \ _mm_cmpgt_epi8(_mm_xor_si128(v0, _mm_set1_epi8(-128)), \ _mm_set1_epi8(-128)))

One arithmetic instruction + one bitwise = 0.83 throughput.

回答1:

There is an example from Simd Library:

    const __m128i K_INV_ZERO = SIMD_MM_SET1_EPI8(0xFF);//_mm_set1_epi8(-1);

    SIMD_INLINE __m128i NotEqual8u(__m128i a, __m128i b)
    {
        return _mm_andnot_si128(_mm_cmpeq_epi8(a, b), K_INV_ZERO);
    }

    SIMD_INLINE __m128i Greater8u(__m128i a, __m128i b)
    {
        return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_min_epu8(a, b), a), K_INV_ZERO);
    }

    SIMD_INLINE __m128i GreaterOrEqual8u(__m128i a, __m128i b)
    {
        return _mm_cmpeq_epi8(_mm_max_epu8(a, b), a);
    }

    SIMD_INLINE __m128i Lesser8u(__m128i a, __m128i b)
    {
        return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_max_epu8(a, b), a), K_INV_ZERO);
    }

    SIMD_INLINE __m128i LesserOrEqual8u(__m128i a, __m128i b)
    {
        return _mm_cmpeq_epi8(_mm_min_epu8(a, b), a);
    }


回答2:

In the spirit of copying code from SIMD libraries here is how Agner Fog's Vector Class Library (C++) does it:

// vector operator >= : returns true for elements for which a >= b (unsigned)
static inline Vec16cb operator >= (Vec16uc const & a, Vec16uc const & b) {
#ifdef __XOP__  // AMD XOP instruction set
    return _mm_comge_epu8(a,b);
#else  // SSE2 instruction set
    return _mm_cmpeq_epi8(_mm_max_epu8(a,b),a); // a == max(a,b)
#endif
}

// vector operator <= : returns true for elements for which a <= b (unsigned)
static inline Vec16cb operator <= (Vec16uc const & a, Vec16uc const & b) {
    return b >= a;
}

// vector operator > : returns true for elements for which a > b (unsigned)
static inline Vec16cb operator > (Vec16uc const & a, Vec16uc const & b) {
#ifdef __XOP__  // AMD XOP instruction set
    return _mm_comgt_epu8(a,b);
#else  // SSE2 instruction set
    return Vec16cb(Vec16c(~(b >= a)));
#endif
}

// vector operator < : returns true for elements for which a < b (unsigned)
static inline Vec16cb operator < (Vec16uc const & a, Vec16uc const & b) {
    return b > a;
}

// vector operator ~ : bitwise not
static inline Vec16uc operator ~ (Vec16uc const & a) {
    return Vec16uc( ~ Vec128b(a));
}

where bitwise not is defined as

// vector operator ~ : bitwise not
static inline Vec128b operator ~ (Vec128b const & a) {
    return _mm_xor_si128(a, _mm_set1_epi32(-1));
}


回答3:

I had an idea for doing >= in two instructions:

  • subtract with unsigned saturation
  • compare with zero

It doesn't help for >, though.

Also, it's pretty much equivalent to ErmIg's SIMD Library answer (max_epu8(a,b) -> cmpeq with a), but worse because it needs a zeroed register. This works for SSE2, instead of SSE4.1, though. psubusb runs on the same ports as pminusb.


A previous version of this answer had the mistaken idea that b-a has the sign bit set iff a>b. But it's actually one bit to the left of that which needs testing: the carry flag / bit (which doesn't exist for packed-integer SIMD).

See the edit history for some ideas about broadcasting the sign bit to the rest of the element with pshufb (negated result) or pblendvb (which may be single-uop on Skylake for the non-VEX version only).



标签: c x86 sse simd sse4