AVX2 what is the most efficient way to pack left b

2018-12-31 18:58发布

问题:

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?

I\'ve seen in SSE where it was done like this: (From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)

__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
 // Move 4 sign bits of mask to 4-bit integer value.
 int mask = _mm_movemask_ps(mask);
 // Select shuffle control data
 __m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
 // Permute to move valid values to front of SIMD register
 __m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
 return packed;
}

This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).

I\'m surprised that AVX doesn\'t appear to have an instruction for simplifying this process, such as a masked store with packing.

I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..

Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?

Here is an illustration of the Left Packing Problem from the above document:

\"Left.Packing.Problem\"

Thanks

回答1:

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)

We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.

We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.


For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.

For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)


The algorithm:

Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111\'110\'101\'...\'010\'001\'000.

Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000\'...\'111\'000\'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000\'...\'010\'000. (i.e. [ ... 2 0 ])

See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.

Now we\'re in the same boat as the compressed-LUT: unpack up to 8 packed indices.

By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it\'s probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)

We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.

See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.

#include <stdint.h>
#include <immintrin.h>

// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
  uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101);  // unpack each bit to a byte
  expanded_mask *= 0xFF;    // mask |= mask<<1 | mask<<2 | ... | mask<<7;
  // ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte

  const uint64_t identity_indices = 0x0706050403020100;    // the identity shuffle for vpermps, packed to one index per byte
  uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);

  __m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
  __m256i shufmask = _mm256_cvtepu8_epi32(bytevec);

  return _mm256_permutevar8x32_ps(src, shufmask);
}

This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).

    # clang 3.7.1 -std=gnu++14 -O3 -march=haswell
    mov     eax, edi                   # just to zero extend: goes away when inlining
    movabs  rcx, 72340172838076673     # The constants are hoisted after inlining into a loop
    pdep    rax, rax, rcx              # ABC       -> 0000000A0000000B....
    imul    rax, rax, 255              # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
    movabs  rcx, 506097522914230528
    pext    rax, rcx, rax
    vmovq   xmm1, rax
    vpmovzxbd       ymm1, xmm1         # 3c latency since this is lane-crossing
    vpermps ymm0, ymm1, ymm0
    ret

So, according to Agner Fog\'s numbers, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it\'s 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There\'s no instruction-level parallelism. In a loop where this isn\'t part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.

This can maybe manage a throughput of one per 3 cycles, bottlenecked on port1 for pdep/pext/imul. Of course, with loads/stores and loop overhead (including the compare, movmsk, and popcnt), total uop throughput can easily be an issue. (e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we\'re lucky, but I think clang failed to account for popcnt\'s false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function.)

gcc does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub. This takes an extra mov instructions, but the end result is a multiply with a latency of 2. (Haswell handles mov at register-rename stage with zero latency.)


Since all hardware that supports AVX2 also supports BMI2, there\'s probably no point providing a version for AVX2 without BMI2.

If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pexp.

You can unpack LUT entries with the same integer sequence I used, but @Froglegs\'s set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn\'t need to go into integer registers in the first place. (A 32bit broadcast-load doesn\'t need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).



回答2:

See my other answer for AVX2+BMI2 with no LUT.

Since you mention a concern about scalability to AVX512: don\'t worry, there\'s an AVX512F instruction for exactly this:

VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It\'s like BMI2 pdep / pext, but for vectors instead of bits in an integer reg.

The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, \"Only the contiguous vector is written to the destination memory location\".

To figure out how far to advance your pointer for the next vector, popcnt the mask.

Let\'s say you want to filter out everything but values >= 0 from an array:

#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
    const float *endp = src+len;
    float *dst_start = dst;
    do {
        __m512      sv  = _mm512_loadu_ps(src);
        __mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ);  // true for src >= 0.0, false for unordered and src < 0.0
        _mm512_mask_compressstoreu_ps(dst, keep, sv);   // clang is missing this intrinsic, which can\'t be emulated with a separate store

        src += 16;
        dst += _mm_popcnt_u64(keep);   // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
    } while (src < endp);
    return dst - dst_start;
}

This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):

 # Output from gcc6.1, with -O3 -march=haswell -mavx512f.  Same with other gcc versions
    lea     rcx, [rsi+rdx*4]             # endp
    mov     rax, rdi
    vpxord  zmm1, zmm1, zmm1             # vpxor  xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
    vmovups zmm0, ZMMWORD PTR [rsi]
    add     rsi, 64
    vcmpps  k1, zmm0, zmm1, 29           # AVX512 compares have mask regs as a destination
    kmovw   edx, k1                      # There are some insns to add/or/and mask regs, but not popcnt
    movzx   edx, dx                      # gcc is dumb and doesn\'t know that kmovw already zero-extends to fill the destination.
    vcompressps     ZMMWORD PTR [rax]{k1}, zmm0
    popcnt  rdx, rdx
    ## movsx   rdx, edx         # with _popcnt_u32, gcc is dumb.  No casting can get gcc to do anything but sign-extend.  You\'d expect (unsigned) would mov to zero-extend, but no.
    lea     rax, [rax+rdx*4]             # dst += ...
    cmp     rcx, rsi
    ja      .L2

    sub     rax, rdi
    sar     rax, 2                       # address math -> element count
    ret


回答3:

I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.

Here is the intrinsics version, along with code to build LUT.

//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(int moveMask) {
    u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
    __m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT

    __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));

    //now shift it right to get 3 bits at bottom
    __m256i shufmask = _mm256_srli_epi32(m, 29);
    return shufmask;
}

u32 get_nth_bits(int a) {
    u32 out = 0;
    int c = 0;
    for (int i = 0; i < 8; ++i) {
        auto set = (a >> i) & 1;
        if (set) {
            out |= (i << (c * 3));
            c++;
        }
    }
    return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];

void BuildPackMask() {
    for (int i = 0; i < 256; ++i) {
        *reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
    }
}

Here is the assembly generated by VS2015:

lea eax, DWORD PTR [rcx+rcx*2]
movsxd  rcx, eax
lea rax, OFFSET FLAT:?g_pack_left_table_u8x3@@3PAEA ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsllvd ymm0, ymm0, YMMWORD PTR __ymm@000000080000000b0000000e0000001100000014000000170000001a0000001d
vpsrld  ymm0, ymm0, 29


回答4:

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.

Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don\'t need to modify the vector.

static inline __m128 LeftPack_SSE2(__m128 val, int mask)  {
  switch(mask) {
  case  0:
  case  1: return val;
  case  2: return _mm_shuffle_ps(val,val,0x01);
  case  3: return val;
  case  4: return _mm_shuffle_ps(val,val,0x02);
  case  5: return _mm_shuffle_ps(val,val,0x08);
  case  6: return _mm_shuffle_ps(val,val,0x09);
  case  7: return val;
  case  8: return _mm_shuffle_ps(val,val,0x03);
  case  9: return _mm_shuffle_ps(val,val,0x0c);
  case 10: return _mm_shuffle_ps(val,val,0x0d);
  case 11: return _mm_shuffle_ps(val,val,0x34);
  case 12: return _mm_shuffle_ps(val,val,0x0e);
  case 13: return _mm_shuffle_ps(val,val,0x38);
  case 14: return _mm_shuffle_ps(val,val,0x39);
  case 15: return val;
  }
}

__m128 foo(__m128 val, __m128 maskv) {
  int mask = _mm_movemask_ps(maskv);
  return LeftPack_SSE2(val, mask);
}