I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this?
Note: I don't want to save the high-half result of each multiplication.
void multiply_vex(long *Gi_vec, long q, long *Gj_vec){
int i;
__m256i data_j, data_i;
__uint64_t *ptr_J = (__uint64_t*)&data_j;
__uint64_t *ptr_I = (__uint64_t*)&data_i;
for (i=0; i<BASE_VEX_STOP; i+=4) {
data_i = _mm256_load_si256((__m256i*)&Gi_vec[i]);
data_j = _mm256_load_si256((__m256i*)&Gj_vec[i]);
ptr_I[0] -= ptr_J[0] * q;
ptr_I[1] -= ptr_J[1] * q;
ptr_I[2] -= ptr_J[2] * q;
ptr_I[3] -= ptr_J[3] * q;
_mm256_store_si256((__m256i*)&Gi_vec[i], data_i);
}
for (; i<BASE_DIMENSION; i++)
Gi_vec[i] -= Gj_vec[i] * q;
}
UPDATE:
I'm using the Haswell microarchitecture with both ICC/GCC compilers. So both AVX and AVX2 is fine.
I substitute the -=
by the C intrisic _mm256_sub_epi64
after the multiplication loop-unroll, where it get some speedup. Currently, it is ptr_J[0] *= q; ...
I use __uint64_t
but is a error. The right data type is __int64_t
.
If you're interested in SIMD 64bx64b to 64b (lower) operations here are the AVX and AVX2 solutions from Agner Fog's Vector Class Library. I would test these with arrays and see how it compares to what GCC does with a generic loop such as the one in Peter Cordes' answer.
AVX (use SSE - you can still compile with
-mavx
to get vex encoding).AVX2
These functions work for signed and unsigned 64-bit integers. In your case since
q
is constant within the loop you don't need to recalculate some things every iteration but your compiler will probably figure that out anyway.You seem to be assuming
long
is 64bits in your code, but then using__uint64_t
as well. In 32bit, the x32 ABI, and on Windows,long
is a 32bit type. Your title mentionslong long
, but then your code ignores it. I was wondering for a while if your code was assuming thatlong
was 32bit.You're completely shooting yourself in the foot by using AVX256 loads but then aliasing a pointer onto the
__m256i
to do scalar operations. gcc just gives up and gives you the terrible code you asked for: vector load and then a bunch ofextract
andinsert
instructions. Your way of writing it means that both vectors have to be unpacked to do thesub
in scalar as well, instead of usingvpsubq
.Modern x86 CPUs have very fast L1 cache that can handle two operations per clock. (Haswell and later: two loads and one store per clock). Doing multiple scalar loads from the same cache line is better than a vector load and unpacking. (Imperfect uop scheduling reduces the throughput to about 84% of that, though: see below)
gcc 5.3 -O3 -march=haswell (Godbolt compiler explorer) auto-vectorizes a simple scalar implementation pretty well. When AVX2 isn't available, gcc foolishly still auto-vectorizes with 128b vectors: On Haswell, this will actually be about 1/2 the speed of ideal scalar 64bit code. (See the perf analysis below, but substitute 2 elements per vector instead of 4).
inner loop:
Translate that back to intrinsics if you want, but it's going to be a lot easier to just let the compiler autovectorize. I didn't try to analyse it to see if it's optimal.
If you don't usually compile with
-O3
, you could use#pragma omp simd
before the loop (and-fopenmp
).Of course, instead of a scalar epilogue, it would prob. be faster to do an unaligned load of the last 32B of Gj_vec, and store into the last 32B of Gi_vec, potentially overlapping with the last store from the loop. (A scalar fallback is still needed if the arrays are smaller than 32B.)
Improved vector intrinsic version for Haswell
From my comments on Z Boson's answer. Based on Agner Fog's vector class library code.
Agner Fog's version saves an instruction but bottlenecks on the shuffle port by using phadd + pshufd where I use psrlq / paddq / pand.
Since one of your operands is constant, make sure to pass
set1(q)
asb
, nota
, so the "bswap" shuffle can be hoisted.See it on Godbolt.
Note that this doesn't include the final subtract, only the multiply.
This version should perform a bit better on Haswell than gcc's autovectorized version. (like maybe one vector per 4 cycles instead of one vector per 5 cycles, bottlenecked on port0 throughput. I didn't consider other bottlenecks for the full problem, since this was a late addition to the answer.)
An AVX1 version (two elements per vector) would suck, and probably still be worse than 64bit scalar. Don't do it unless you already have your data in vectors, and want the result in a vector (extracting to scalar and back might not be worth it).
Perf analysis of GCC's autovectorized code (not the intrinsic version)
Background: see Agner Fog's insn tables and microarch guide, and other links in the x86 tag wiki.
Until AVX512 (see below), this is probably only barely faster than scalar 64bit code:
imul r64, m64
has a throughput of one per clock on Intel CPUs (but one per 4 clocks on AMD Bulldozer-family). load/imul/sub-with-memory-dest is 4 fused-domain uops on Intel CPUs (with an addressing mode that can micro-fuse, which gcc fails to use). The pipeline width is 4 fused-domain uops per clock, so even a large unroll can't get this to issue at one-per-clock. With enough unrolling, we'll bottleneck on load/store throughput. 2 loads and one store per clock is possible on Haswell, but stores-address uops stealing load ports will lower the throughput to about 81/96 = 84% of that, according to Intel's manual.So perhaps the best way for Haswell would load and multiply with scalar, (2 uops), then
vmovq
/pinsrq
/vinserti128
so you can do the subtract with avpsubq
. That's 8 uops to load&multiply all 4 scalars, 7 shuffle uops to get the data into a __m256i (2 (movq) + 4 (pinsrq is 2 uops) + 1 vinserti128), and 3 more uops to do a vector load / vpsubq / vector store. So that's 18 fused-domain uops per 4 multiplies (4.5 cycles to issue), but 7 shuffle uops (7 cycles to execute). So nvm, this is no good compared to pure scalar.The autovectorized code is using 8 vector ALU instructions for each vector of four values. On Haswell, 5 of those uops (multiplies and shifts) can only run on port 0, so no matter how you unroll this algorithm it will achieve at best one vector per 5 cycles (i.e. one multiply per 5/4 cycles.)
The shifts could be replaced with
pshufb
(port 5) to move the data and shift in zeros. (Other shuffles don't support zeroing instead of copying a byte from the input, and there aren't any known-zeros in the input that we could copy.)paddq
/psubq
can run on ports 1/5 on Haswell, or p015 on Skylake.Skylake runs
pmuludq
and immediate-count vector shifts on on p01, so it could in theory manage a throughput of one vector per max(5/2, 8/3, 11/4) = 11/4 = 2.75 cycles. So it bottlenecks on total fused-domain uop throughput (including the 2 vector loads and 1 vector store). So a bit of loop unrolling will help. Probably resource conflicts from imperfect scheduling will bottleneck it to slightly less than 4 fused-domain uops per clock. The loop overhead can hopefully run on port 6, which can only handle some scalar ops, includingadd
and compare-and-branch, leaving ports 0/1/5 for vector ALU ops, since they're close to saturating (8/3 = 2.666 clocks). The load/store ports are nowhere near saturating, though.So, Skylake can theoretically manage one vector per 2.75 cycles (plus loop overhead), or one multiply per ~0.7 cycles, vs. Haswell's best option (one per ~1.2 cycles in theory with scalar, or one per 1.25 cycles in theory with vectors). The scalar one per ~1.2 cycles would probably require a hand-tuned asm loop, though, because compilers don't know how to use a one-register addressing mode for stores, and a two-register addressing mode for loads (
dst + (src-dst)
and incrementdst
).Also, if your data isn't hot in L1 cache, getting the job done with fewer instructions lets the frontend get ahead of the execution units, and get started on the loads before the data is needed. Hardware prefetch doesn't cross page lines, so a vector loop will probably beat scalar in practice for large arrays, and maybe even for smaller arrays.
AVX-512DQ introduces a 64bx64b->64b vector multiply
gcc can auto-vectorize using it, if you add
-mavx512dq
.So AVX512DQ (expected to be part of Skylake multi-socket Xeon (Purley) in ~2017) will give a much larger than 2x speedup (from wider vectors) if these instructions are pipelined at one per clock.
Update: Skylake-AVX512 (aka SKL-X or SKL-SP) runs VPMULLQ at one per 1.5 cycles for xmm, ymm, or zmm vectors. It's 3 uops with 15c latency. (With maybe an extra 1c of latency for the zmm version, if that's not a measurement glitch in the AIDA results.)
vpmullq
is much faster than anything you can build out of 32-bit chunks, so it's very much worth having an instruction for this even if current CPUs don't have 64-bit-element vector-multiply hardware. (Presumably they use the mantissa multipliers in the FMA units.)