可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a use case, where I have array of bits each bit is represented as 8 bit integer for example uint8_t data[] = {0,1,0,1,0,1,0,1}; I want to create a single integer by extracting only lsb of each value. I know that using int _mm_movemask_pi8 (__m64 a) function I can create a mask but this intrinsic only takes a msb of a byte not lsb. Is there a similar intrinsic or efficient method to extract lsb to create single 8 bit integer?

回答1:

There is no direct way to do it, but obviously you can simply shift the lsb into the msb and then extract it:

_mm_movemask_pi8(_mm_slli_si64(x, 7))

Using MMX these days is strange and should probably be avoided.

Here is an SSE2 version, still reading only 8 bytes:

int lsb_mask8(uint8_t* bits) {
    __m128i x = _mm_loadl_epi64((__m128i*)bits);
    return _mm_movemask_epi8(_mm_slli_epi64(x, 7));
}

Using SSE2 instead of MMX avoids the needs for EMMS

回答2:

If you have efficient BMI2 pext (e.g. Haswell and newer, same as AVX2), then use the inverse of @wim's answer on your question about going the other direction (How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD).

unsigned extract8LSB(uint8_t *arr) {
    uint64_t bytes;
    memcpy(&bytes, arr, 8);
    unsigned LSBs = _pext_u64(bytes ,0x0101010101010101);
    return LSBs;
}

This compiles like you'd expect to a qword load + a pext instruction. Compilers will hoist the 0x01... constant setup out of a loop after inlining.

pext / pdep are efficient on Intel CPUs that support them (3 cycle latency / 1c throughput, 1 uop, same as a multiply). But they're not efficient on AMD, like 18c latency and throughput. (https://agner.org/optimize/). If you care about AMD, you should definitely use @harold's pmovmskb answer.

Or if you have multiple contiguous blocks of 8 bytes, do them with a single wide vector, and get a 32-bit bitmap. You can split that up if needed, or unroll the loop using by 4, to right-shift the bitmap to get all 4 single-byte results.

If you're just storing this to memory right away, then you should probably have done this extraction in the loop that wrote the source data, instead of a separate loop, so it would still be hot in cache. AVX2 _mm256_movemask_epi8 is a single uop (on Intel CPUs) with low latency, so if your data isn't hot in L1d cache then a loop that just does this would not be keeping its execution units busy while waiting for memory.