I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:
1*5+2*6+3*7+4*8
Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn't have an intrinsic function associated with it. At this point, I don't want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn't find the answer on Google.
Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.
Thanks for your help.
GCC (at least version 4.3) includes <smmintrin.h>
with SSE4.1 level intrinsics, including the single and double-precision dot products:
_mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
_mm_dp_pd (__m128d __X, __m128d __Y, const int __M);
As a fallback for older processors, you can use this algorithm to create the dot product of the vectors a
and b
:
r1 = _mm_mul_ps(a, b);
r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_hadd_ps(r2, r2);
_mm_store_ss(&result, r3);
There is an article by Intel here which touches on dot-product implementations.
I wrote this and compiled it with gcc -O3 -S -ftree-vectorize -ftree-vectorizer-verbose=2 sse.c
void f(int * __restrict__ a, int * __restrict__ b, int * __restrict__ c, int * __restrict__ d,
int * __restrict__ e, int * __restrict__ f, int * __restrict__ g, int * __restrict__ h,
int * __restrict__ o)
{
int i;
for (i = 0; i < 8; ++i)
o[i] = a[i]*e[i] + b[i]*f[i] + c[i]*g[i] + d[i]*h[i];
}
And GCC 4.3.0 auto-vectorized it:
sse.c:5: note: LOOP VECTORIZED.
sse.c:2: note: vectorized 1 loops in function.
However, it would only do that if I used a loop with enough iterations -- otherwise the verbose output would clarify that vectorization was unprofitable or the loop was too small. Without the __restrict__
keywords it has to generate separate, non-vectorized versions to deal with cases where the output o
may point into one of the inputs.
I would paste the instructions as an example, but since part of the vectorization unrolled the loop it's not very readable.
I'd say the fastest SSE method would be:
static inline float CalcDotProductSse(__m128 x, __m128 y) {
__m128 mulRes, shufReg, sumsReg;
mulRes = _mm_mul_ps(x, y);
// Calculates the sum of SSE Register - https://stackoverflow.com/a/35270026/195787
shufReg = _mm_movehdup_ps(mulRes); // Broadcast elements 3,1 to 2,0
sumsReg = _mm_add_ps(mulRes, shufReg);
shufReg = _mm_movehl_ps(shufReg, sumsReg); // High Half -> Low Half
sumsReg = _mm_add_ss(sumsReg, shufReg);
return _mm_cvtss_f32(sumsReg); // Result in the lower part of the SSE Register
}
I followed - Fastest Way to Do Horizontal Float Vector Sum On x86.