Is it my imagination, or is a PNOT
instruction missing from SSE and AVX? That is, an instruction which flips every bit in the vector.
If yes, is there a better way of emulating it than PXOR
with a vector of all 1s? Quite annoying since I need to set up a vector of all 1s to use that approach.
For cases such as this it can be instructive to see what a compiler would generate.
E.g. for the following function:
#include <immintrin.h>
__m256i test(const __m256i v)
{
return ~v;
}
both gcc and clang seem to generate much the same code:
test(long long __vector(4)):
vpcmpeqd ymm1, ymm1, ymm1
vpxor ymm0, ymm0, ymm1
ret
If you use Intrinsics you can use an inline function like this to have the not operation separately.
inline __m256i _mm256_not_si256 (__m256i a){
//return _mm256_xor_si256 (a, _mm256_set1_epi32(0xffffffff));
return _mm256_xor_si256 (a, _mm256_cmpeq_epi32(a,a));//I didn't check wich one is faster
}
You can use the PANDN
OpCode for that.
PANDN
implements the operation
DEST = NOT(DEST) AND SRC ; (SSEx)
or
DEST = NOT(SRC1) AND SRC2 ; (AVXx)
Combining this operation with an all-ones vector effectively results in a PNOT operation.
Some x86(SSEx) assembly code would look like this:
; XMM0 is input register
PCMPEQB xmm1, xmm1 ; Whole xmm1 reg set to 1's
PANDN xmm0, xmm1 ; xmm0 = NOT(xmm0) AND xmm1
; XMM0 contains NOT(XMM0)
Some x86(AVXx) assembly code would look like this:
; YMM0 is input register
VPCMPEQB ymm1, ymm1, ymm1 ; Whole ymm1 reg set to 1's
VPANDN ymm0, ymm0, ymm1 ; ymm0 = NOT(ymm0) AND ymm1
; YMM0 contains NOT(YMM0)
Both can (of course) easily be translated to intrinsics.
AVX512F vpternlogd
/ _mm512_ternarylogic_epi32(__m512i a, __m512i b, __m512i c, int imm8)
finally provides a way to implement NOT without any extra constants, using a single instruction (which has 2 per clock throughput on Skylake-avx512 and KNL, so it's not quite as good as PXOR / XORPS for 256b and smaller vectors.)
vpternlogd zmm,zmm,zmm, imm8
has 3 input vectors and one output, modifying the destination in place. With the right immediate, you can still implement a copy-and-NOT into a different register, but it will have a "false" dependency on the output register (which vpxord dst, src, all-ones
wouldn't have).
TL:DR: probably still use xor with all-ones as part of a loop, unless you're running out of registers. vpternlog
may cost an extra vmovdqa
register-copy instruction if its input is needed later. Outside of loops, vpternlogd zmm,zmm,zmm, 0xff
is the compiler's best option for creating a 512b vector of all-ones in the first place, because AVX512 compare instructions compare into masks (k0-k7
), so XOR with all-ones might already involve a vpternlogd
, or maybe a broadcast-constant from memory.
For each bit position i
, the output bit is imm[ (DEST[i]<<2) + (SRC1[i]<<1) + SRC2[i]]
, where the imm8
is treated as an 8-element bitmap.
Thus, if we want the result to depend only on SRC2 (which is the zmm/m512/m32bcst
operand), we should choose a bitmap of repeating 1,0, with 1
at the even positions (selected by src2=0
).
vpternlogd zmm1, zmm2,zmm2, 01010101b ; 0x55
If you're lucky, a compiler will optimize _mm512_xor_epi32(v, set1(-1))
to vpternlogd
for you if it's profitable.
// To hand-hold a compiler into saving a vmovdqa32 if needed:
__m512i tmp = something earlier;
__m512i t2 = _mm...(tmp);
// use-case: tmp is dead, t2 and ~t2 are both needed.
__m512i t2_inv = _mm512_ternarylogic_epi32(tmp, t2, t2, 0b01010101);
If you're not sure that's a good idea, just keep it simple and use the same variable for all 3 inputs:
__m512i t2_inv = _mm512_ternarylogic_epi32(t2, t2, t2, 0b01010101);