I want to speed up the following operation with AVX2 instructions, but I was not able to find a way to do so.
I am given a large array uint64_t data[100000]
of uint64_t's, and an array unsigned char indices[100000]
of bytes. I want to output an array uint64_t Out[256]
where the i-th value is the xor of all data[j]
such that index[j]=i
.
A straightforward implementation of what I want is this:
uint64_t Out[256] = {0}; // initialize output array
for (i = 0; i < 100000 ; i++) {
Out[Indices[i]] ^= data[i];
}
Can we implement this more efficiently with AVX2 instructions?
EDIT : This is what my code looks like now
uint64_t Out[256][4] = {0}; // initialize output array
for (i = 0; i < 100000 ; i+=4) {
Out[Indices[i ]][0] ^= data[i];
Out[Indices[i+1]][1] ^= data[i+1];
Out[Indices[i+2]][2] ^= data[i+2];
Out[Indices[i+3]][3] ^= data[i+3];
}
Based on static analysis for Haswell/Skylake, I came up with a version that runs in ~5 cycles per 4
i
values, instead of 8 cycles, when compiled by gcc. Average for large sizes, not including the time to combine multiple copies ofOut[]
, and assuming a random distribution of Indices that doesn't lead to any store/reload dep chains running for long enough to matter.IDK if you care about Ryzen or Excavator (the other 2 mainstream AVX2 microachitectures).
I haven't done a careful analysis by hand, but IACA is wrong for HSW/SKL and thinks some instructions don't micro-fuse when in fact they do (tested on an i7-6700k with perf counters), so it thinks the front-end bottleneck is more severe than it really is. e.g.
movhps
load+merge micro-fuses, but IACA thinks it doesn't even with simple addressing modes.We should have negligible any cache misses, because
uint64_t Out[4][256]
is only 8kiB. So our cache footprint is only 1/4 of L1d size on most recent CPUs, and should be mostly fine even with hyperthreading sharing L1d between two logical threads. Looping overdata[]
andIndices[]
should prefetch well, and hopefully doesn't evictOut[]
much. Thus, static analysis has a good chance of being somewhat accurate, and it's quicker than careful micro-benchmarking and more importantly tells you exactly what the bottlenecks are.But of course we're relying heavily on out-of-order execution and imperfect scheduling or other unexpected bottlenecks could easily happen. I didn't feel like actually microbenchmarking if I'm not getting paid, though.
This is basically a histogram problem. The usual histogram optimization of using multiple tables and combining at the end applies. SIMD XOR is useful for the combine-at-the-end (as long as you use
Out[4][256]
, notOut[256][4]
. The latter also makes the indexing slower by requiring scaling by8*4
instead of8
(which can be done with a single LEA in a scaled-index addressing mode)).But unlike a normal histogram, you're XORing in some data from memory instead of ADDing a constant 1. So instead of an immediate
1
, the code has to loaddata[i]
into a register as a source forxor
. (Or load, thenxor reg, data[i]
/ store). This is even more total memory operations than a histogram.We come out ahead from "manual" gather/scatter into SIMD vectors (using
movq
/movhps
loads/stores), allowing us to use SIMD for thedata[i]
load and XOR. This reduces the total number of load operations, and thus reduces load-port pressure without costing extra front-end bandwidth.Manual gather into 256-bit vectors probably is probably not worth the extra shuffling (an extra vinserti128 / vextracti128 just so we can combine 2 memory-source
vpxor
s into one 256-bit one). 128-bit vectors should be good. Front-end throughput is also a major issue, because (on Intel SnB-family CPUs) you want to avoid indexed addressing modes for the stores. gcc useslea
instructions to calculate addresses in registers, instead of using indexed loads/stores. clang / LLVM with-march=skylake
decides not to, which is a bad decision in this case because the loop bottlenecks on port 2 / port 3, and spending extra ALU uops to allow stores-address uops to use port 7 is a win. But if you're not bottlenecked on p23, spending extra uops to avoid indexed stores is not good. (And in cases where the can stay micro-fused, definitely not just to avoid indexed loads; silly gcc). Maybe gcc and LLVM's addressing-mode cost models aren't very accurate, or they don't model the pipeline in enough detail to figure out when a loop bottlenecks on the front-end vs. a specific port.Choice of addressing-modes and other asm code-gen choices are critical for this to perform optimally on SnB-family. But writing in C gives you no control over that; you're mostly at the mercy of the compiler, unless you can tweak the source to get it to make a different choice. e.g. gcc vs. clang makes a significant difference here.
On SnB-family, a
movhps
load needs port 5 for the shuffle/blend (although it does micro-fuse into one uop), but amovhps
store is a pure store with no ALU uop. So it's break-even there, and lets us use one SIMD load / XOR for two data elements.With AVX, unaligned memory source operands are allowed for ALU uops, so we don't need to require alignment for
data[]
. But Intel HSW/SKL can keep an indexed addressing mode micro-fused withpxor
but notvpxor
. So compiling without AVX enabled can be better, allowing the compiler to use an indexed addressing mode instead of incrementing a separate pointer. (Or making it faster if the compiler doesn't know this and uses an indexed addressing mode anyway.) TL:DR: probably best to require 16-byte aligneddata[]
and compile that function with AVX disabled, for better macro-fusion. (But then we miss out on 256-bit SIMD for combining theOut
slices at the end, unless we put that in a different function compiled with AVX or AVX2)Avoiding unaligned loads will avoid any cache-line splits, too, which doesn't cost extra uops but we're probably close to bottlenecking on L1d throughput limits, not just load/store execution unit throughput limits.
I also looked at loading 4 indices at once and unpacking with ALU instructions. e.g. with
memcpy
intostruct { uint8_t idx[4]; } idx;
. But gcc generates multiple wasted instructions for unpacking that. Too bad x86 doesn't have great bitfield instructions like ARMubfx
or especially PowerPCrlwinm
(which could leave the result left-shifted for free, so if x86 had that, a staticOut
could have used a base+disp32 addressing mode in non-PIC code.)Unpacking a dword with shift / movzx from AL/AH is a win if we're using scalar XOR, but it looks like it's not when we're using SIMD for
data[]
and spending front-end throughput onlea
instructions to allow store-address uops to run on port 7. That makes us front-end bottlenecked instead of port2/3 bottlenecked, so using 4xmovzx
loads from memory looks best according to static analysis. Would be worth benchmarking both ways if you take the time to hand-edit the asm. (The gcc-generated asm with extra uops is just bad, including a completely redundantmovzx
after right-shifting by 24, leaving the upper bits already zero.)The code
(See it on the Godbolt compiler explorer, along with a scalar version):
Compiled with gcc7.3
-std=gnu11 -DIACA_MARKS -O3 -march=skylake -mno-avx
, and analyzed with IACA-3.0:gcc8.1 on Godbolt uses a scaled-index addressing mode for
pxor
, using the same counter for Indices anddata[]
, so that saves anadd
.clang doesn't use LEA, and bottlenecks at 4
i
s per 7 cycles, because none of the store uops can run on port 7.The scalar version (still using 4 slices of
Out[4][256]
):The loop is 4 fused-domain uops shorter than what IACA counts, because it doesn't know that only SnB/IvB un-laminate indexed stores. HSW/SKL don't. Such stores still can't use port 7, though, so this won't get any better than ~6.5 cycles for 4 elements.
(And BTW, with naive handling of Indices[i], loading each one separately with movzx, you get 8 cycles for 4 elements, saturating ports 2 and 3. Even though gcc doesn't generate throughput-optimal code for unpacking the struct, the 4-byte load + unpack should be a net win by relieving some load-port pressure.)
Cleanup loop:
AVX2 really shines here: we loop over the lowest slice of the histogram, and XOR in the other slices. This loop is 8 front-end uops with 4 loads on Skylake, and should run at 1 iter per 2 clocks:
I tried to reduce the uop count further by doing the XORs in one chain, but gcc insists on doing two
vmovdqa
loads and having to do onevpxor
without a memory operand. (OoO exec will hide the latency of this tiny chain / tree of VPXOR so it doesn't matter.)No, you'd use a gather to get the old values, then SIMD XOR, then scatter the updated elements back to the locations they came from.
To avoid conflicts, you might want
out[8][256]
so every vector element can use a different table. (Otherwise you have a problem ifIndices[i+0]
andIndices[i+4]
were equal, because the scatter store would just store the highest vector element with that index.Scatter/gather instructions need a single base register, but you can simply add
_mm256_setr_epi64(0, 256, 256*2, ...);
after doing avpmovzxbq
zero-extending load.Notes
I used IACA2.3 for the scalar analysis because IACA3.0 seems to have removed the
-mark
option to choose which loop to analyze when you have multiple marks in one file. IACA3.0 didn't fix any of the ways that IACA2.3 is wrong about SKL's pipeline in this case.You can sort the data according to indices[i]... This should take O(N*log2(N)), but that can be parallelized.
Then taking the cumulative xor of the sorted data -- which can be also parallelized.
Then it's the matter of calculating
Out[i] = CumXor(j) ^ Out[i-1];