The Intel Advanced Vector Extensions (AVX) offers no dot product in the 256-bit version (YMM register) for double precision floating point variables. The "Why?" question have been very briefly treated in another forum (here) and on Stack Overflow (here). But the question I am facing is how to replace this missing instruction with other AVX instructions in an efficient way?
The dot product in 256-bit version exists for single precision floating point variables (reference here):
__m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);
The idea is to find an efficient equivalent for this missing instruction:
__m256d _mm256_dp_pd(__m256d m1, __m256d m2, const int mask);
To be more specific, the code I would like to transform from __m128
(four floats) to __m256d
(4 doubles) use the following instructions:
__m128 val0 = ...; // Four float values
__m128 val1 = ...; //
__m128 val2 = ...; //
__m128 val3 = ...; //
__m128 val4 = ...; //
__m128 res = _mm_or_ps( _mm_dp_ps(val1, val0, 0xF1),
_mm_or_ps( _mm_dp_ps(val2, val0, 0xF2),
_mm_or_ps( _mm_dp_ps(val3, val0, 0xF4),
_mm_dp_ps(val4, val0, 0xF8) )));
The result of this code is a _m128
vector of four floats containing the results of the dot products between val1
and val0
, val2
and val0
, val3
and val0
, val4
and val0
.
Maybe this can give hints for the suggestions?
I would use a 4*double multiplication, then a
hadd
(which unfortunately adds only 2*2 floats in the upper and lower half), extract the upper half (a shuffle should work equally, maybe faster) and add it to the lower half.The result is in the low 64 bit of
dotproduct
.Edit:
After an idea of Norbert P. I extended this version to do 4 dot products at one time.
I would extend drhirsch's answer to perform two dot products at the same time, saving some work:
Then
dot(x,y)
is in the low double anddot(z,w)
is in the high double ofdotproduct
.For a single dot-product, it's simply a vertical multiply and horizontal sum (see Fastest way to do horizontal float vector sum on x86).
hadd
costs 2 shuffles + anadd
. It's almost always sub-optimal for throughput when used with both inputs = the same vector.If you only need one dot product, this is better than @hirschhornsalz's single-vector answer by 1 shuffle uop on Intel, and a bigger win on AMD Jaguar / Bulldozer-family / Ryzen because it narrows down to 128b right away instead of doing a bunch of 256b stuff. AMD splits 256b ops into two 128b uops.
It can be worth using
hadd
in cases like doing 2 or 4 dot products in parallel where you're using it with 2 different input vectors. Norbert'sdot
of two pairs of vectors looks optimal if you want the results packed. I don't see any way to do better even with AVX2vpermpd
as a lane-crossing shuffle.Of course if you really want one larger
dot
(of 8 or moredouble
s), use verticaladd
(with multiple accumulators to hidevaddps
latency) and do the horizontal summing at the end. You can also usefma
if available.haddpd
internally shufflesxy
andzw
together two different ways and feeds that to a verticaladdpd
, and that's what we'd do by hand anyway. If we keptxy
andzw
separate, we'd need 2 shuffles + 2 adds for each one to get a dot product (in separate registers). So by shuffling them together withhadd
as a first step, we save on the total number of shuffles, only on adds and total uop count.But for AMD, where
vextractf128
is very cheap, and 256bhadd
costs 2x as much as 128bhadd
, it could make sense to narrow each 256b product down to 128b separately and then combine with a 128b hadd.Actually, according to Agner Fog's tables,
haddpd xmm,xmm
is 4 uops on Ryzen. (And the 256b ymm version is 8 uops). So it's actually better to use 2xvshufpd
+vaddpd
manually on Ryzen, if that data is right. It might not be: his data for Piledriver has 3 uophaddpd xmm,xmm
, and it's only 4 uops with a memory operand. It doesn't make sense to me that they couldn't implementhadd
as only 3 (or 6 for ymm) uops.For doing 4
dot
s with the results packed into one__m256d
, the exact problem asked, I think @hirschhornsalz's answer looks very good for Intel CPUs. I haven't studied it super-carefully, but combining in pairs withhadd
is good.vperm2f128
is efficient on Intel (but quite bad on AMD: 8 uops on Ryzen with one per 3c throughput).