As you know, the first two are AVX-specific intrinsics and the second is a SSE4.1 intrinsic. Both sets of intrinsics can be used to check for equality of 2 floating-point vectors. My specific use case is:
_mm_cmpeq_ps
or_mm_cmpeq_pd
, followed by_mm_testc_ps
or_mm_testc_pd
on the result, with an appropriate mask
But AVX provides equivalents for "legacy" intrinsics, so I might be able to use _mm_testc_si128
, after a cast of the result to __m128i
. My questions are, which of the two use cases results in better performance and where I can find out what legacy SSE instructions are provided by AVX.
Oops, I didn't read the question carefully. You're talking about using these after a
cmpeqps
. They're always slower thanmovmskps / test
if you already have a mask.cmpps
/ptest / jcc
is 4 uops.cmpps
/movmskps eax, xmm0
/test eax,eax
/jnz
is 3 uops. (test/jnz fuse into a single uop). Also, none of the instructions are multi-uop, so no decode bottlenecks.Only use
ptest
/vtestps/pd
when you can take full advantage of the AND or ANDN operation to avoid an earlier step. I've posted answers before where I comparedptest
vs. an alternative. I think I did find one case once whereptest
was a win, but it's hard to use. Yup, found it: someone wanted an FP compare that was true for NaN == NaN. It's one of the only times I've ever found a use for the carry flag result ofptest
.If the high element of a compare result is "garbage", then you can still ignore it cheaply with
movmskps
:This is totally free. The x86
test
instruction works a lot likeptest
: You can use it with an immediate mask instead of to test a register against itself. (It actually has a tiny cost: one extra byte of machine code, becausetest eax, 3
is one byte longer thantest eax, eax
, but they run identically.).See the x86 wiki for links to guides (Agner Fog's guide is good for perf analysis at the instruction level). There's an AVX version of every legacy SSE instruction, but some are only 128 bits wide. They all get an extra operand (so the dest doesn't have to be one of the src regs), which saves on
mov
instructions to copy registers.Answer to a question you didn't ask:
Neither
_mm_testc_ps
nor_mm_testc_si128
can be used to compare floats for equality.vtestps
is likeptest
, but only operates on the sign bits of each float element.They all compute
(~x) & y
(on sign bits or on the full register), which doesn't tell you whether they're equal, or even whether the sign bits are equal.Note that even checking for bitwise equality of floats (with
pcmpeqd
) isn't the same ascmpeqps
(which implements C's==
operator), because-0.0
isn't bitwise equal to0.0
. And two bitwise-identical NaNs aren't equal to each other. The comparison is unordered (which means not equal) if either or both operand isNaN
.