Intel's intrinsics guide lists a number of intrinsics for the AVX-512 K* mask instructions, but there seem to be a few missing:
- KSHIFT{L/R}
- KADD
- KTEST
The Intel developer manual claims that intrinsics are not necessary as they are auto generated by the compiler. How does one do this though? If it means that __mmask* types can be treated as regular integers, it would make a lot of sense, but testing something like mask << 4
seems to cause the compiler to move the mask to a regular register, shift it, then move back to a mask. This was tested using Godbolt's latest GCC and ICC with -O2 -mavx512bw
.
Also interesting to note that the intrinsics only deal with __mmask16
and not other types. I haven't tested much, but it looks like ICC doesn't mind taking in an incorrect type, but GCC does seem to try and ensure that there are only 16-bits in the mask, if you use the intrinsics.
Am I not looking past the correct intrinsics for the above instructions, as well as other __mmask* type variants, or is there another way to achieve the same thing without resorting to inline assembly?
Intel's documentation saying, "not necessary as they are auto generated by the compiler" is in fact correct. And yet, it's unsatisfying.
But to understand why it is the way it is, you need to look at the history of the AVX512. While none of this information is official, it's strongly implied based on evidence.
The reason the state of the mask intrinsics got into the mess they are now is probably because AVX512 got "rolled out" in multiple phases without sufficient forward planning to the next phase.
Phase 1: Knights Landing
Knights Landing added 512-bit registers which only have 32-bit and 64-bit data granularity. Therefore the mask registers never needed to be wider than 16 bits.
When Intel was designing these first set of AVX512 intrinsics, they went ahead and added intrinsics for almost everything - including the mask registers. This is why the mask intrinsics that do exist are only 16 bits. And they only cover the instructions that exist in Knights Landing. (though I can't explain why
KSHIFT
is missing)On Knights Landing, mask operations were fast (2 cycles). But moving data between mask registers and general registers was really slow (5 cycles). So it mattered where the mask operations were being done and it made sense to give the user finer-grained control about moving stuff back-and-forth between mask registers and GPRs.
Phase 2: Skylake Purley
Skylake Purley extends the AVX512 to cover byte-granular lanes. And this increased the width of the mask registers to the full 64 bits. This second round also added
KADD
andKTEST
which didn't exist in the Knights Landing.These new mask instructions (
KADD
,KTEST
, and 64-bit extensions of existing ones) are the ones that are missing their intrinsic counterparts.While we don't know exactly why they are missing, there is some strong evidence in support of it:
Compiler/Syntax:
On Knights Landing, the same mask intrinsics were used for both 8-bit and 16-bit masks. There was no way to distinguish between them. By extended them to 32-bit and 64-bit, it made the mess worse. In other words, Intel didn't design the mask intrinsics correctly to begin with. And they decided to drop them completely rather than fix them.
Performance Inconsistencies:
Bit-crossing mask instructions on Skylake Purley are slow. While all bit-wise instructions are single-cycle,
KADD
,KSHIFT
,KUNPACK
, etc... are all 4 cycles. But moving between mask and GPR is only 2 cycles.Because of this, it's often faster to move them into GPRs to do them and move them back. But the programmer is unlikely to know this. So rather than giving the user full control of the mask registers, Intel opted just have the compiler make this decision.
By making the compiler make this decision, it means that the compiler needs to have such logic. The Intel Compiler currently does as it will generate
kadd
and family in certain (rare) cases. But GCC does not. On GCC, all but the most trivial mask operations will be moved to GPRs and done there instead.Final Thoughts:
Prior to the release of Skylake Purley, I personally had a lot of AVX512 code written up which includes a lot of AVX512 mask code. These were written with certain performance assumptions (single-cycle latency) that turned out to be false on Skylake Purley.
From my own testing on Skylake X, some of my mask-intrinsic code which relied on bit-crossing operations turned out to be slower than the compiler-generated versions that moved them to GPRs and back. The reason of course is that
KADD
andKSHIFT
was 4 cycles instead of 1.Of course, I prefer if Intel did provide the intrinsics to give us the control that I want. But it's very easy to go wrong here (in terms of performance) if you don't know what you're doing.
Update:
It's unclear when this happened, but the latest version of the Intel Intrinsics Guide has a new set of mask intrinsics with a new naming convention that covers all the instructions and widths. These new intrinsics supercede the old ones.
So this solves the entire problem. Though the extent of compiler support is still uncertain.
Examples:
_kadd_mask64()
_kshiftri_mask32()
_cvtmask16_u32()
supercedes_mm512_mask2int()