Optimisation using SSE Intrinsics

I am trying to convert a loop I have into a SSE intrinsics. I seem to have made fairly good progress, and by that I mean It's in the correct direction however I appear to have done some of the translation wrong somewhere as I am not getting the same "correct" answer which results from the non-sse code.

My original loop which I unrolled by a factor of 4 looks like this:

int unroll_n = (N/4)*4;

for (int j = 0; j < unroll_n; j++) {
        for (int i = 0; i < unroll_n; i+=4) {
            float rx = x[j] - x[i];
            float ry = y[j] - y[i];
            float rz = z[j] - z[i];
            float r2 = rx*rx + ry*ry + rz*rz + eps;
            float r2inv = 1.0f / sqrt(r2);
            float r6inv = r2inv * r2inv * r2inv;
            float s = m[j] * r6inv;
            ax[i] += s * rx;
            ay[i] += s * ry;
            az[i] += s * rz;
            //u
            rx = x[j] - x[i+1];
             ry = y[j] - y[i+1];
             rz = z[j] - z[i+1];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+1] += s * rx;
            ay[i+1] += s * ry;
            az[i+1] += s * rz;
            //unroll i 3
             rx = x[j] - x[i+2];
             ry = y[j] - y[i+2];
             rz = z[j] - z[i+2];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+2] += s * rx;
            ay[i+2] += s * ry;
            az[i+2] += s * rz;
            //unroll i 4
             rx = x[j] - x[i+3];
             ry = y[j] - y[i+3];
             rz = z[j] - z[i+3];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+3] += s * rx;
            ay[i+3] += s * ry;
            az[i+3] += s * rz;
    }
}

I essentially then went line by line for the top section and converted it into SSE intrinsics. The code is below. I'm not totally sure if the top three lines are needed however I understand that my data needs to be 16bit aligned for this to work correctly and optimally.

float *x = malloc(sizeof(float) * N);
float *y = malloc(sizeof(float) * N);
float *z = malloc(sizeof(float) * N); 

for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i+=4) {
        __m128 xj_v = _mm_set1_ps(x[j]);
        __m128 xi_v = _mm_load_ps(&x[i]);
        __m128 rx_v = _mm_sub_ps(xj_v, xi_v);

        __m128 yj_v = _mm_set1_ps(y[j]);
        __m128 yi_v = _mm_load_ps(&y[i]);
        __m128 ry_v = _mm_sub_ps(yj_v, yi_v);

        __m128 zj_v = _mm_set1_ps(z[j]);
        __m128 zi_v = _mm_load_ps(&z[i]);
        __m128 rz_v = _mm_sub_ps(zj_v, zi_v);

    __m128 r2_v = _mm_mul_ps(rx_v, rx_v) + _mm_mul_ps(ry_v, ry_v) + _mm_mul_ps(rz_v, rz_v) + _mm_set1_ps(eps);

    __m128 r2inv_v = _mm_div_ps(_mm_set1_ps(1.0f),_mm_sqrt_ps(r2_v));

    __m128 r6inv_1v = _mm_mul_ps(r2inv_v, r2inv_v);
    __m128 r6inv_v = _mm_mul_ps(r6inv_1v, r2inv_v);

    __m128 mj_v = _mm_set1_ps(m[j]);
    __m128 s_v = _mm_mul_ps(mj_v, r6inv_v);

    __m128 axi_v = _mm_load_ps(&ax[i]);
    __m128 ayi_v = _mm_load_ps(&ay[i]);
    __m128 azi_v = _mm_load_ps(&az[i]);

    __m128 srx_v = _mm_mul_ps(s_v, rx_v);
    __m128 sry_v = _mm_mul_ps(s_v, ry_v);
    __m128 srz_v = _mm_mul_ps(s_v, rz_v);

    axi_v = _mm_add_ps(axi_v, srx_v);
    ayi_v = _mm_add_ps(ayi_v, srx_v);
    azi_v = _mm_add_ps(azi_v, srx_v);

    _mm_store_ps(ax, axi_v);
    _mm_store_ps(ay, ayi_v);
    _mm_store_ps(az, azi_v);
    }
}

I feel the main idea is correct however there is an/some error(s) somewhere as the resulting answer is incorrect.

Thanks in advance.

I think your only bugs are simple typos, not logic errors, see below.

Can't you just use clang's auto-vectorization? Or do you need to use gcc for this code? Auto-vectorization would let you make SSE, AVX, and (in future) AVX512 versions from the same source with no modifications. Intrinsics aren't scalable to different vector sizes, unfortunately.

Based on your start at vectorizing, I made an optimized version. You should try it out, I'm curious to hear if it's faster than your version with bugfixes, or clang's auto-vectorized version. :)

This looks wrong:

_mm_store_ps(ax, axi_v);
_mm_store_ps(ay, ayi_v);
_mm_store_ps(az, azi_v);

You loaded from ax[i], but now you're storing to ax[0].

Also, clang's unused-variable warning found this bug:

axi_v = _mm_add_ps(axi_v, srx_v);
ayi_v = _mm_add_ps(ayi_v, srx_v);  // should be sry_v
azi_v = _mm_add_ps(azi_v, srx_v);  // should be srz_v

Like I said in my answer on your previous question, you should probably interchange the loops, so the same ax[i+0..3], ay[i+0..3], and az[i+0..3] are used, avoiding that load/store.

Also, if you're not going to use rsqrtps + Newton-Raphson, you should use the transformation I pointed out in my last answer: divide m[j] by sqrt(k2) ^3. There's no point dividing 1.0f by something using a divps, and then later multiplying only once.

rsqrt might not actually be a win, because total uop throughput might be more of a bottleneck than div / sqrt throughput or latency. three multiplies + FMA + rsqrtps is significantly more uops than sqrtps + divps. rsqrt is more helpful with AVX 256b vectors, because the divide / sqrt unit isn't full-width on SnB to Broadwell. Skylake has 12c latency sqrtps ymm, same as for xmm, but throughput is still better for xmm (one per 3c instead of one per 6c).

clang and gcc were both using rsqrtps / rsqrtss when compiling your code with -ffast-math. (only clang using the packed version, of course.)

If you don't interchange the loops, you should manually hoist everything that only depends on j out of the inner loop. Compilers are generally good at this, but it still seems like a good idea to make the source reflect what you expect the compiler to be able to do. This helps with "seeing" what the loop is actually doing.

Here's a version with some optimizations over your original:

To get gcc/clang to fuse the mul/adds into FMA, I used -ffp-contract=fast. This gets FMA instructions for high throughput without using -ffast-math. (There is a lot of parallelism with the three separate accumulators, so the increased latency of FMA compared to addps shouldn't hurt at all. I expect port0/1 throughput is the limiting factor here.) I thought gcc did this automatically, but it seems it doesn't here without -ffast-math.
Notice that v^3/2 = sqrt(v)³ = sqrt(v)*v. This has lower latency and fewer instructions.
Interchanged the loops, and use broadcast-loads in the inner loop to improve locality (cut bandwidth requirement by 4, or 8 with AVX). Each iteration of the inner loop only reads 4B of new data from each of the four source arrays. (x,y,z, and m). So it makes a lot of use of each cache line while it's hot.

Using broadcast-loads in the inner loop also means we accumulate ax[i + 0..3] in parallel, avoiding the need for a horizontal sum, which takes extra code. (See a previous version of this answer for code with the loops interchanged, but that used vector loads in the inner loop, with stride = 16B.)

It compiles nicely for Haswell with gcc, using FMA. (Still only 128b vector size though, unlike clang's auto-vectorized 256b version). The inner loop is only 20 instructions, and only 13 of those are FPU ALU instructions that need port 0/1 on Intel SnB-family. It makes decent code even with baseline SSE2: no FMA, and needs shufps for the broadcast-loads, but those don't compete for execution units with add/mul.

#include <immintrin.h>

void ffunc_sse128(float *restrict ax, float *restrict ay, float *restrict az,
           const float *x, const float *y, const float *z,
           int N, float eps, const float *restrict m)
{
  for (int i = 0; i < N; i+=4) {
    __m128 xi_v = _mm_load_ps(&x[i]);
    __m128 yi_v = _mm_load_ps(&y[i]);
    __m128 zi_v = _mm_load_ps(&z[i]);

    // vector accumulators for ax[i + 0..3] etc.
    __m128 axi_v = _mm_setzero_ps();
    __m128 ayi_v = _mm_setzero_ps();
    __m128 azi_v = _mm_setzero_ps();

    // AVX broadcast-loads are as cheap as normal loads,
    // and data-reuse meant that stand-alone load instructions were used anyway.
    // so we're not even losing out on folding loads into other insns
    // An inner-loop stride of only 4B is a huge win if memory / cache bandwidth is a bottleneck
    // even without AVX, the shufps instructions are cheap,
    // and don't compete with add/mul for execution units on Intel

    for (int j = 0; j < N; j++) {
      __m128 xj_v = _mm_set1_ps(x[j]);
      __m128 rx_v = _mm_sub_ps(xj_v, xi_v);

      __m128 yj_v = _mm_set1_ps(y[j]);
      __m128 ry_v = _mm_sub_ps(yj_v, yi_v);

      __m128 zj_v = _mm_set1_ps(z[j]);
      __m128 rz_v = _mm_sub_ps(zj_v, zi_v);

      __m128 mj_v = _mm_set1_ps(m[j]);   // do the load early

      // sum of squared differences
      __m128 r2_v = _mm_set1_ps(eps) + rx_v*rx_v + ry_v*ry_v + rz_v*rz_v;   // GNU extension
      /* __m128 r2_v = _mm_add_ps(_mm_set1_ps(eps), _mm_mul_ps(rx_v, rx_v));
      r2_v = _mm_add_ps(r2_v, _mm_mul_ps(ry_v, ry_v));
      r2_v = _mm_add_ps(r2_v, _mm_mul_ps(rz_v, rz_v));
      */

      // rsqrt and a Newton-Raphson iteration might have lower latency
      // but there's enough surrounding instructions and cross-iteration parallelism
      // that the single-uop sqrtps and divps instructions prob. aren't be a bottleneck
      __m128 r2sqrt = _mm_sqrt_ps(r2_v);
      __m128 r6sqrt = _mm_mul_ps(r2_v, r2sqrt);  // v^(3/2) = sqrt(v)^3 = sqrt(v)*v
      __m128 s_v = _mm_div_ps(mj_v, r6sqrt);

      __m128 srx_v = _mm_mul_ps(s_v, rx_v);
      __m128 sry_v = _mm_mul_ps(s_v, ry_v);
      __m128 srz_v = _mm_mul_ps(s_v, rz_v);

      axi_v = _mm_add_ps(axi_v, srx_v);
      ayi_v = _mm_add_ps(ayi_v, sry_v);
      azi_v = _mm_add_ps(azi_v, srz_v);
    }
    _mm_store_ps(&ax[i], axi_v);
    _mm_store_ps(&ay[i], ayi_v);
    _mm_store_ps(&az[i], azi_v);
  }
}

I also tried a version with rcpps, but IDK if it will be faster. Note that with -ffast-math, gcc and clang will convert the division into rcpps + a Newton iteration. (They for some reason don't convert 1.0f/sqrtf(x) into rsqrt + Newton, even in a stand-alone function). clang does a better job, using FMA for the iteration step.

#define USE_RSQRT
#ifndef USE_RSQRT
      // even with -mrecip=vec-sqrt after -ffast-math, this still does sqrt(v)*v, then rcpps
      __m128 r2sqrt = _mm_sqrt_ps(r2_v);
      __m128 r6sqrt = _mm_mul_ps(r2_v, r2sqrt);  // v^(3/2) = sqrt(v)^3 = sqrt(v)*v
      __m128 s_v = _mm_div_ps(mj_v, r6sqrt);
#else
      __m128 r2isqrt = rsqrt_float4_single(r2_v);
      // can't use the sqrt(v)*v trick, unless we either do normal sqrt first then rcpps
      // or rsqrtps and rcpps.  Maybe it's possible to do a Netwon Raphson iteration on that product
      // instead of refining them both separately?
      __m128 r6isqrt = r2isqrt * r2isqrt * r2isqrt;
      __m128 s_v = _mm_mul_ps(mj_v, r6isqrt);
#endif