AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a)
it returns a vector where for every element in a
a bit is set if it has the same value. Is there a way to do something similar in AVX2?
I'm not interested in the extact bits I just need to know which elements are duplicates of the elements to their left (or right). I simply need to know if a scatter would conflict.
Basically I need an AVX2 equivalent for
__mm256i detect_conflict(__mm256i a) {
__mm256i cd = _mm256_conflict_epi32(a);
return _mm256_cmpgt_epi32(cd, _mm256_set1_epi32(0));
}
The only way I could think of is to use _mm256_permutevar8x32_epi32()
shift each value right by 1 (across the lanes) and than do seven compares, mask out the unsed bits and than _mm256_or_si256()
them together which is horribly slow.
TL:DR: Since full detection of which elements conflict is expensive, it's probably worth doing more fall-back work in exchange for cheaper detection. This depends on your conflict-handling options / strategies.
I came up with a fairly efficient way check for presence/absence of conflicts without finding their locations, like this answer for 64-bit integer elements. It's actually faster than Skylake-AVX512's micro-coded
vpconflictd ymm
, but of course it gives you much less information. (KNL has fastvpconflictd
).You could use a fully-scalar fallback for all the elements if there are any conflicts. This would work well if conflicts are rare enough that branch-mispredicts don't kill performance. (AVX2 doesn't have scatter instructions in the first place, though, so I'm not sure exactly what you need this for.)
The only-left or only-right behaviour is hard, but my method can give you a mask of which elements have conflicts with any other element (e.g.
v[0] == v[3]
would result in bothconflict[0]
andconflict[3]
being true). This costs only 1 extra shuffle, or maybe 0 with a redesign with this goal in mind.(I misread the question at first; I thought you wanted to check both directions, rather than talking about two different implementation options for most of what
vpconflictd
does. Actually at first I thought you just wanted a presence/absence check, likebool any_conflicts(__m256i)
.)Finding presence/absence of any conflicts:
bool any_conflicts32(__m256i)
8 choose 2
is 28 total scalar comparisons. That's 3.5 vectors of packed comparisons. We should aim to do it with 4 vector compares, which leaves room for some redundancy.Creating inputs for those compares will require shuffles, and some of those will have to be lane-crossing. 4 unique comparisons require at least 4 vectors (including the initial unshuffled copy), since 3 choose 2 is only 3.
Ideally as few as possible of the shuffles are lane-crossing, and there is lots of ILP for the compares and ORing of compare results. Also nice if the shuffles don't need a vector shuffle-control, just an
imm8
. Also good if they're not slow on AMD Ryzen, where 256b instructions are decoded into multiple 128b uops. (Some shuffles are worse than others for this, e.g.vperm2i128
is very bad; much worse thanvpermq
for swapping the high and low halves of a single vector. Unfortunately clang gets this wrong even with-mtune=znver1
, and compiles_mm256_permute4x64_epi64
intovperm2i128
whenever it can).I found a solution pretty early that achieves most of these goals: 3 shuffles, 4 compares. One of the shuffles is in-lane. All of them use an immediate control byte instead of a vector.
How I designed this:
I made a table of all the element-pairs that needed to be checked, and made columns for which shuffled operands could take care of that requirement.
I started with a few shuffles that could be done cheaply, and it turned out my early guesses worked well enough.
My design notes:
It turns out that in-lane rotr1 == full rotl2 has a lot of redundancy, so it's not worth using. It also turns out that having all the allowed redundancy in
v==hilo
works fine.If you care about which result is in which element (rather than just checking for presence/absence), then
v == swap_hilo(lrotr1)
could work instead oflrotr1 == hilo
. But we also needswap_hilo(v)
, so this would mean an extra shuffle.We could instead shuffle after hilo==lrotr1, for better ILP. Or maybe there's a different set of shuffles that gives us everything. Maybe if we consider VPERMD with a vector shuffle-control...
Compiler asm output vs. optimal asm
gcc6.3
-O3 -march=haswell
produces:Haswell has one shuffle unit (on port5).
So the best-case latency is 8 cycles to have a single vector ready, given resource conflicts from other instructions in this sequence but assuming no conflicts with past instructions still in the pipeline. (Should have been 7 cycles, but gcc re-ordered the dependency structure of my intrinsics putting more stuff dependent on the compare of the last shuffle result.)
This is faster than Skylake-AVX512's
vpconflictd ymm
, which has 17c latency, one per 10c throughput. (Of course, that gives you much more information, and @harold's emulation of it takes many more instructions).Fortunately gcc didn't re-order the shuffles and introduce a potential write-back conflict. (e.g. putting the
vpshufd
last would mean that dispatching the shuffle uops to port5 in oldest-first order would have thevpshufd
ready in the same cycle as the firstvpermq
(1c latency vs. 3c).) gcc did this for one version of the code (where I compared the wrong variable), so it seems that gcc-mtune=haswell
doesn't take this into account. (Maybe it's not a big deal, I haven't measured to see what the real effect on latency is. I know the scheduler is smart about picking uops from the Reservation Station to avoid actual write-back conflicts, but IDK how smart it is, i.e. whether it would run thevpshufd
ahead of a latervpermq
to avoid a write-back conflict, since it would have to look-ahead to even see the upcoming writeback conflict. More likely it would just delay thevpshufd
for an extra cycle before dispatching it.)Anyway, this is why I put
_mm_shuffle_epi32
in the middle in the C source, where it makes things easy for OOO execution.Clang 4.0 goes berserk and packs each compare result down to 128b vectors (with
vextracti128
/vpacksswb
), then expands back to 256b after threevpor xmm
before pmovmskb. I thought at first it was doing this because of-mtune=znver1
, but it does it with-mtune=haswell
as well. It does this even if we return abool
, which would let it justpmovmskb
/test
on the packed vector. /facepalm. It also pessimizes the hilo shuffle tovperm2i128
, even with-mtune=znver1
(Ryzen), wherevperm2i128
is 8 uops butvpermq
is 3. (Agner Fog's insn tables for some reasons missed those, so I took those numbers from the FP equivalentsvperm2f128
andvpermpd
)@harold says that using
add
instead ofor
stops clang from packing/unpacking, butvpaffffd
has lower throughput thanvpor
on Intel pre-Skylake.Even better for Ryzen, the
v == hilo
compare can do only the low half. (i.e. usevpcmpeqd xmm2, xmm2, xmm3
, which is only 1 uop instead of 2). We still need the fullhilo
forhilo == lrot1
, though. So we can't just usevextracti128 xmm2, xmm0, 1
instead of thevpermq
shuffle.vextracti128
has excellent performance on Ryzen: 1 uop, 1c latency, 0.33c throughput (can run on any of P0/1/3).Since we're ORing everything together, it's fine to have zeros instead of redundant compare results in the high half.
As I noted in comments, IDK how to safely write this with intrinsics. The obvious way would be to use
_mm256_castsi128_si256 (_mm_cmpeq_epi32(v, hilo))
, but that technically leaves the high lane undefined, rather than zero. There's no sane way a compiler would do anything other than use the full-width ymm register that contains the xmm register with the 128b compare result, but it would be legal according to Intel's docs for a Deathstation-9000 compiler to put garbage there. Any explicit way of getting zeros in the high half would depend on the compiler optimizing it away. Maybe_mm256_setr_si128(cmpresult, _mm_setzero_si128());
.There are no current CPUs with AVX512F but not AVX512CD. But if that combo is interesting or relevant, clang makes some interesting asm from my code with
-mavx512f -mavx512vl
. It uses EVEXvpcmpeqd
into mask registers, andkorw
to merge them. But then it expands that back into a vector to set up forvpmovmaskb
, instead of just optimizing away the movemask and using thekorw
result. /facepalm.