I am trying to find the first instance of a character, in this case '"' using simd (AVX2 or earlier). I'd like to use _mm256_cmpeq_epi8, but then I need a quick way of finding if any of the resulting bytes in the __m256i have been set to 0xFF. The plan was then to use _mm256_movemask_epi8 to convert the result from bytes to bits, and the to use ffs to get a matching index. Is it better to move out a portion at a time using _mm_movemask_epi8? Any other suggestions?
问题:
回答1:
You have the right idea with _mm256_cmpeq_epi8
-> _mm256_movemask_epi8
. AFAIK, that's the optimal way to implement this for Intel CPUs at least. PMOVMSKB r32, ymm
is the same speed as the XMM 16-byte version, so it would be a huge loss to unpack the two lanes of a 256b vector and movemask them separately and then recombine the integer results. (Source: Agner Fog's instruction table. See other perf links in the x86 tag wiki.)
Make the code inside the loop as efficient as possible by leaving the ffs
until after you've identified a non-zero result from _mm256_movemask_epi8
.
TEST/JCC can macro fuse into a single uop, but BSF/JCC doesn't, so it takes an extra instruction. (And you'd be hard-pressed to get a C compiler to emit BSF/JCC anyway. More likely branching on the result of ffs
would give you some kind of test for the input being non-zero, then BSF, then add 1, then compare-and-branch. That's obviously horrible compared to just testing the movemask result.)
Also note that for similar problems, comparing the movemask (e.g. to check that it's 0xFFFFFFFF) is just as efficient as branching on it being non-zero.
As Paul R suggested, looking at some strlen, strchr, and memchr implementations may be informative. There are multiple hand-written asm implementations in open-source libc implementations, and other places. (e.g. glibc, and Agner Fog's asmlib.)
Many of glibc's versions scan up to an alignment boundary, then use an unrolled loop that reads 64B at a time (in 4 SSE vectors, since I don't think glibc has an AVX2 version).
To optimize for long strings, reduce overhead from testing the compare results by ORing the compare results together, and check that. If you find a hit, go back and re-test your vectors to see which vector had the hit.
It may be somewhat more efficient to do the ffs
on one 64-bit integer that you built up out of multiple movemask results (with shift and |
). I'm not sure about doing this inside the loop before testing for zero; I don't remember if one of glibc's strlen strategies did that or not.
Everything I've suggested here is stuff can be seen in asm in various glibc strategies for strlen, memchr, and related functions. Here's sysdeps/x86_64/strlen.S, but I there may be another source file somewhere using more than baseline SSE2. (Or not, I might be thinking of a different function, maybe there's nothing to be gained beyond SSE2, until AVX (3-operand insns) and AVX2 (256b integer vectors).
See also:
- glibc's
strchr-avx2.S
(Woboq.org has a nice source browser with a useful search for filenames / symbols). - glibc's
memchr-avx2.S
glibc's memchr uses PMAXUB instead of POR. I'm not sure if that's useful for some arcane microarchitectural reason, but it runs on fewer ports on most CPUs. Perhaps that's desired, to avoid resource conflicts with something else? IDK, seems weird, since it competes with PCMPEQB.