I need to check that all vector elements are non-zero. So far I found following solution. Is there a better way to do this? I am using gcc 4.8.2 on Linux/x86_64, instructions up to SSE4.2.
typedef char ChrVect __attribute__((vector_size(16), aligned(16)));
inline bool testNonzero(ChrVect vect)
{
const ChrVect vzero = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
return (0 == (__int128_t)(vzero == vect));
}
Update: code above is compiled to following assembler code (when compiled as non-inline function):
movdqa %xmm0, -24(%rsp)
pxor %xmm0, %xmm0
pcmpeqb -24(%rsp), %xmm0
movdqa %xmm0, -24(%rsp)
movq -24(%rsp), %rax
orq -16(%rsp), %rax
sete %al
ret
With straight SSE intrinsics you might do it like this:
inline bool testNonzero(__m128i v)
{
__m128i vcmp = _mm_cmpeq_epi8(v, _mm_setzero_si128());
#if __SSE4_1__ // for SSE 4.1 and later use PTEST
return _mm_testz_si128(vcmp, vcmp);
#else // for older SSE use PMOVMSKB
uint32_t mask = _mm_movemask_epi8(vcmp);
return (mask == 0);
#endif
}
I suggest looking at what your compiler currently generates for your existing code and then compare it with this version using intrinsics and see if there is any significant difference.
With SSE3 (clang -O3 -msse3
) I get the following for the above function:
pxor %xmm1, %xmm1
pcmpeqb %xmm1, %xmm0
pmovmskb %xmm0, %ecx
testl %ecx, %ecx
The SSE4 version (clang -O3 -msse4.1
) produces:
pxor %xmm1, %xmm1
pcmpeqb %xmm1, %xmm0
ptest %xmm0, %xmm0
Note that the zeroing of xmm1
will typically be hoisted out of any loop containing this function, so the above sequences should be reduced by one instruction when used inside a loop.