The movemask instruction(s) take an __m256i and return an int32 where each bit (either the first 4, 8 or all 32 bits depending on the input vector element type) is the most significant bit of the corresponding vector element.
I would like to do the inverse: take a 32 (where only the 4, 8 or 32 least significant bits are meaningful), and get a __m256i where the most significant bit of each int8, int32 or int64 sized block is set to the original bit.
Basically, I want to go from a compressed bitmask to one that is usable as a mask by other AVX2 instructions (such as maskstore, maskload, mask_gather).
I couldn't quickly find an instruction that does it, so I am asking here. If there isn't one instruction with that functionality, is there a clever hack you can think of that achieves this in very few instructions?
My current method is to use a 256 element lookup table. I want to use this operation within a loop where not much else is happening, to speed it up. Note, I'm not too interested in long multi-instruction sequences or little loops that implement this operation.
There is no single instruction in AVX2 or earlier.
vpbroadcastw
/vpand
/vpcmpeqw
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?
Also Fastest way to unpack 32 bits to a 32 byte SIMD vector.
If you're loading the bitmap from memory, loading it straight into vector registers for an ALU strategy should work well.
If you have the bitmap as a computation result, then it will be in an integer register where you can use it as a LUT index easily, so that's a good choice if you're aiming for 64-bit elements. Otherwise probably still go ALU for 32-bit elements or smaller, instead of a giant LUT or doing multiple chunks.
We'll have to wait for AVX-512's mask registers before cheap conversion from integer bitmasks to vector masks are possible. (With
kmovw k1, r/m16
, which compilers generate implicitly forint => __mmask16
). There's an AVX512 insn to set a vector from a mask (VPMOVM2D zmm1, k1
,_mm512_movm_epi8/16/32/64
, with other versions for different element sizes), but you generally don't need it since everything that used to use mask vectors now uses mask registers. Maybe if you want to count elements that meet some comparison condition? (where you'd usepcmpeqd
/psubd
to generate and accumulate the vector of 0 or -1 elements). But scalarpopcnt
on the mask results would be a better bet.For 64-bit elements, the mask only has 4 bits, so a lookup table is reasonable. You can compress the LUT by loading it with
VPMOVSXBQ ymm1, xmm2/m32
. (_mm256_cvtepi8_epi64
). This gives you a LUT size of (1<<4) = 16 * 4 bytes = 64B = 1 cache line. Unfortunately,pmovsx
is inconvenient to use as a narrow load with intrinsics.Especially if you already have your bitmap in an integer register (instead of memory), a
vpmovsxbq
LUT should be excellent inside an inner loop for 64-bit elements. Or if instruction throughput or shuffle throughput is a bottleneck, use an uncompressed LUT. This can let you (or the compiler) use the mask vector as a memory operand for something else, instead of needing a separate instruction to load it.LUT for 32-bit elements: probably not optimal but here's how you could do it
With 32-bit elements, an 8-bit mask gives you 256 possible vectors, each 8 elements long. 256 * 8B = 2048 bytes, which is a pretty big cache footprint even for the compressed version (load with
vpmovsxbd ymm, m64
).To work around this, you can split the LUT into 4-bit chunks. It takes about 3 integer instructions to split up an 8-bit integer into two 4-bit integers (
mov/and/shr
). Then with an uncompressed LUT of 128b vectors (for 32-bit element size),vmovdqa
the low half andvinserti128
the high half. You could still compress the LUT, but I wouldn't recommend it because you'll needvmovd
/vpinsrd
/vpmovsxbd
, which is 2 shuffles (so you probably bottleneck on uop throughput).Or 2x
vpmovsxbd xmm, [lut + rsi*4]
+vinserti128
is probably even worse on Intel.ALU alternative: good for 16/32/64-bit elements
When the whole bitmap fits in each element, broadcast it, AND with a selector mask, and VPCMPEQ against the same constant (which can stay in a register across multiple uses of this in a loop).
(The mask could come from an integer register with vmovd + vpbroadcastd, but a broadcast-load
For 8-bit elements, you will need to
vpshufb
thevpbroadcastd
result to get the relevant bit into each byte. See How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?. But for 16-bit and wider elements, the number of elements is <= the element width, so a broadcast-load does this for free. (16-bit broadcast loads do cost a micro-fused ALU shuffle uop, unlike 32 and 64-bit broadcast loads which are handled entirely in the load ports.)vpbroadcastd/q
doesn't even cost any ALU uops, it's done right in the load port. (b
andw
are load+shuffle). Even if there your masks are packed together (one per byte for 32 or 64-bit elements), it might still be more efficient tovpbroadcastd
instead ofvpbroadcastb
. Thex & mask == mask
check doesn't care about garbage in the high bytes of each element after the broadcast. The only worry is cache-line / page splits.Variable shift (cheaper on Skylake) if you need just the sign bit
Variable blends and masked loads/stores only care about the sign bit of the mask elements.
This is only 1 uop (on Skylake) once you have the 8-bit mask broadcast to dword elements.
vpbroadcastd
is as cheap as a load from memory (no ALU uop at all on Intel CPUs and Ryzen). (Narrower broadcasts, likevpbroadcastb y,mem
take an ALU shuffle uop on Intel, but maybe not on Ryzen.)The variable-shift is slightly expensive on Haswell/Broadwell (3 uops, limited execution ports), but as cheap as immediate-count shifts on Skylake! (1 uop on port 0 or 1.) On Ryzen they're also only 2 uops (the minimum for any 256b operation), but have 3c latency and one per 4c throughput.
See the x86 tag wiki for perf info, especially Agner Fog's insn tables.
For 64-bit elements, note that arithmetic right shifts are only available in 16 and 32-bit element size. Use a different strategy if you want the whole element set to all-zero / all-one for 4 bits -> 64-bit elements.
With intrinsics:
Inside a loop, a LUT might be worth the cache footprint, depending on the instruction mix in the loop. Especially for 64-bit element size where it's not much cache footprint, but possibly even for 32-bit.
Another option, instead of variable shift, is to use BMI2 to unpack each bit to a byte with that mask element in the high bit, then
vpmovsx
:If you already have masks in an integer register (where you'd have to
vmovq
/vpbroadcastd
separately anyway), then this way is probably better even on Skylake where variable-count shifts are cheap.If your masks start in memory, the other ALU method (
vpbroadcastd
directly into a vector) is probably better, because broadcast-loads are so cheap.Note that
pdep
is 6 dependent uops on Ryzen (18c latency, 18c throughput), so this method is horrible on Ryzen even if your masks do start in integer regs.(Future readers, feel free to edit in an intrinsics version of this. It's easier to write asm because it's a lot less typing, and the asm mnemonics are easier to read (no stupid
_mm256_
clutter all over the place).)