Handling zeroes in _mm256_rsqrt_ps()

Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps(), looking around it seems that doing:

_mm256_mul_ps(_mm256_rsqrt_ps(eightFloats),
              eightFloats);

Is the way to go for that extra bit of performance and avoiding a pipeline stall.

Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0). What is the best way around this? I have tried this (which works and is faster), but is there a better way, or am I going to run into problems under certain conditions?

_mm256_mul_ps(_mm256_rsqrt_ps(_mm256_max_ps(eightFloats,
                                            _mm256_set1_ps(0.1))),
              eightFloats);

My code is for a vertical application, so I can assume that it will be running on a Haswell CPU (i7-4810MQ), so FMA/AVX2 can be used. The original code is approximately:

float vals[MAX];
int sum = 0;
for (int i = 0; i < MAX; i++)
{
    int thisSqrt = (int) floor(sqrt(vals[i]));
    sum += min(thisSqrt, 0x3F);
}

All the values of vals should be integer values. (Why everything isn't just int is a different question...)

tl;dr: See the end for code that compiles and should work.

To just solve the 0.0 problem, you could also special-case inputs of 0.0 with an FP compare of the source against 0.0. Use the compare result as a mask to zero out any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x)). Clang does this when autovectorizing. (But it uses blendps with the zeroed vector, instead of using the compare mask with andnps directly to zero or preserve elements.)

It would also be possible to use sqrt(x) ~= recip(rsqrt(x)), as suggested by njuffa. rsqrt(0) = +Inf. recip(+Inf) = 0. However, using two approximations would compound the relative error, which is a problem.

The thing you're missing:

Truncating to integer (instead of rounding) requires an accurate sqrt result when the input is a perfect square. If the result for 25*rsqrt(25) is 4.999999 or something (instead of 5.00001), you'll add 4 instead of 5.

Even with a Newton-Raphson iteration, rsqrtps isn't perfectly accurate the way sqrtps is, so it might still give 5.0 - 1ulp. (1ulp = one unit in the last place = lowest bit of the mantissa).

Also:

Newton Raphson formula explained
Newton Raphson SSE implementation performance (latency/throughput). Note that we care more about throughput than latency, since we're using it in a loop that doesn't do much else. sqrt isn't part of the loop-carried dep chain, so different iterations can have their sqrt calcs in flight at once.

It might be possible to kill 2 birds with one stone by adding a small constant before doing the (x+offset)*approx_rsqrt(x+offset) and then truncating to integer. Large enough to overcome the max relative error of 1.5*2^-12, but small enough not to bump sqrt_approx(63*63-1+offset) up to 63 (the most sensitive case).

63*1.5*2^(-12)         ==  0.023071...
approx_sqrt(63*63-1)   == 62.99206... +/- 0.023068..

Actually, we're screwed without a Newton iteration even without adding anything. approx_sqrt(63*63-1) could come out above 63.0 all by itself. n=36 is the largest value where the relative error in sqrt(n*n-1) + error is less than sqrt(n*n). GNU Calc:

define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; }
; f(36)
35.98610843089316319413
~0.01317850650545403926 ~35.99928693739861723339
; f(37)
36.9864840178138587015
~0.01354485498699237990 ~37.00002887280085108140

Does your source data have any properties that mean you don't have to worry about it being just below a large perfect square? e.g. is it always perfect squares?

You could check all possible input values, since the important domain is very small (integer FP values from 0..63*63) to see if the error in practice is small enough on Intel Haswell, but that would be a brittle optimization that could make your code break on AMD CPUs, or even on future Intel CPUs. Unfortunately, just coding to the ISA spec's guarantee that the relative error is up to 1.5*2^-12 requires more instructions. I don't see any tricks a NR iteration.

If your upper limit was smaller (like 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)). You'd get 20 for 20*20, but always 19 for 20*20-1.

; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; }
; test_approx_sqrt(20, 0.5)
~20.01249609618950056874 ~19.98749609130668473087   # (x+0.5)/sqrt(x+0.5)
 ~0.00732879495710064718  ~0.00731963968187500662   # relative error

Note that val * (x +/- err) = val*x +/- val*err. IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for FP relative errors.

Anyway, I think you need one Newton-Raphson iteration.

The best bet is to add 0.5 to your input before doing an approx_sqrt using rsqrt. That sidesteps the 0/0 = NaN problem, and pushes the +/- error range all to one side of the whole number cut point (for numbers in the range we care about).

FP min/max instructions have the same performance as FP add, and will be on the critical path either way. Using an add instead of a max also solves the problem of results for perfect squares potentially being a few ulp below the correct result.

Compiler output: a decent starting point

I get pretty good autovectorization results from clang 3.7.1 with sum_int, with -fno-math-errno -funsafe-math-optimizations. -ffinite-math-only is not required (but even with the full -ffast-math, clang avoids sqrt(0) = NaN when using rsqrtps).

sum_fp doesn't auto-vectorize, even with the full -ffast-math.

However clang's version suffers from the same problem as your idea: truncating an inexact result from rsqrt + NR, potentially giving the wrong integer. IDK if this is why gcc doesn't auto-vectorize, because it could have used sqrtps for a big speedup without changing the results. (At least, as long as all the floats are between 0 and INT_MAX², otherwise converting back to integer will give the "indefinite" result of INT_MIN. (sign bit set, all other bits cleared). This is a case where -ffast-math breaks your program, unless you use -mrecip=none or something.

See the asm output on godbolt from:

// autovectorizes with clang, but has rounding problems.
// Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps)
int sum_int(float vals[]){
  int sum = 0;
  for (int i = 0; i < MAX; i++) {
    int thisSqrt = (int) sqrtf(vals[i]);
    sum += std::min(thisSqrt, 0x3F);
  }
  return sum;
}

To manually vectorize with intrinsics, we can look at the asm output from -fno-unroll-loops (to keep things simple). I was going to include this in the answer, but then realized that it had problems.

putting it together:

I think converting to int inside the loop is better than using floorf and then addps. roundps is a 2-uop instruction (6c latency) on Haswell (1uop in SnB/IvB). Worse, both uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port1, with 3c latency, and then we can use integer min and add to clamp and accumulate, so port5 gets something to do. Using an integer vector accumulator also means the loop-carried dependency chain is 1 cycle, so we don't need to unroll or use multiple accumulators to keep multiple iterations in flight. Smaller code is always better for the big picture (uop cache, L1 I-cache, branch predictors).

As long as we aren't in danger of overflowing 32bit accumulators, this seems to be the best choice. (Without having benchmarked anything or even tested it).

I'm not using the sqrt(x) ~= approx_recip(approx_sqrt(x)) method, because I don't know how to do a Newton iteration to refine it (probably it would involve a division). And because the compounded error is larger.

Horizontal sum from this answer.

Complete but untested version:

#include <immintrin.h>
#define MAX 4096

// 2*sqrt(x) ~= 2*x*approx_rsqrt(x), with a Newton-Raphson iteration
// dividing by 2 is faster in the integer domain, so we don't do it
__m256 approx_2sqrt_ps256(__m256 x) {
    // clang / gcc usually use -3.0 and -0.5.  We could do the same by using fnmsub_ps (add 3 = subtract -3), so we can share constants
    __m256 three = _mm256_set1_ps(3.0f);
    //__m256 half = _mm256_set1_ps(0.5f);  // we omit the *0.5 step

    __m256 nr  = _mm256_rsqrt_ps( x );  // initial approximation for Newton-Raphson
    //           1/sqrt(x) ~=    nr  * (3 - x*nr * nr) * 0.5 = nr*(1.5 - x*0.5*nr*nr)
    // sqrt(x) = x/sqrt(x) ~= (x*nr) * (3 - x*nr * nr) * 0.5
    // 2*sqrt(x) ~= (x*nr) * (3 - x*nr * nr)
    __m256 xnr = _mm256_mul_ps( x, nr );

    __m256 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three );  // -(xnr*nr) + 3
    return _mm256_mul_ps( xnr, three_minus_muls );
}

// packed int32_t: correct results for inputs from 0 to well above 63*63
__m256i isqrt256_ps(__m256 x) {
    __m256 offset = _mm256_set1_ps(0.5f);    // or subtract -0.5, to maybe share constants with compiler-generated Newton iterations.
    __m256 xoff = _mm256_add_ps(x, offset);  // avoids 0*Inf = NaN, and rounding error before truncation
    __m256 approx_2sqrt_xoff = approx_2sqrt_ps256(xoff);
    __m256i i2sqrtx = _mm256_cvttps_epi32(approx_2sqrt_xoff);
    return _mm256_srli_epi32(i2sqrtx, 1);     // divide by 2 with truncation
    // alternatively, we could mask the low bit to zero and divide by two outside the loop, but that has no advantage unless port0 turns out to be the bottleneck
}

__m256i isqrt256_ps_simple_exact(__m256 x) {
    __m256 sqrt_x = _mm256_sqrt_ps(x);
    __m256i isqrtx = _mm256_cvttps_epi32(sqrt_x);
    return isqrtx;
}

int hsum_epi32_avx(__m256i x256){
    __m128i xhi = _mm256_extracti128_si256(x256, 1);
    __m128i xlo = _mm256_castsi256_si128(x256);
    __m128i x  = _mm_add_epi32(xlo, xhi);
    __m128i hl = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
    hl  = _mm_add_epi32(hl, x);
    x   = _mm_shuffle_epi32(hl, _MM_SHUFFLE(2, 3, 0, 1));
    hl  = _mm_add_epi32(hl, x);
    return _mm_cvtsi128_si32(hl);
}

int sum_int_avx(float vals[]){
  __m256i sum = _mm256_setzero_si256();
  __m256i upperlimit = _mm256_set1_epi32(0x3F);

  for (int i = 0; i < MAX; i+=8) {
    __m256 v = _mm256_loadu_ps(vals+i);
    __m256i visqrt = isqrt256_ps(v);
    // assert visqrt == isqrt256_ps_simple_exact(v) or something
    visqrt = _mm256_min_epi32(visqrt, upperlimit);
    sum = _mm256_add_epi32(sum, visqrt);
  }
  return hsum_epi32_avx(sum);
}

Compiles on godbolt to nice code, but I haven't tested it. clang makes slightly nicer code that gcc: clang uses broadcast-loads from 4B locations for the set1 constants, instead of repeating them at compile time into 32B constants. gcc also has a bizarre movdqa to copy a register.

Anyway, the whole loop winds up being only 9 vector instructions, compared to 12 for the compiler-generated sum_int version. It probably didn't notice the x*initial_guess(x) common-subexpressions that occur in the Newton-Raphson iteration formula when you're multiplying the result by x, or something like that. It also does an extra mulps instead of a psrld because it does the *0.5 before converting to int. So that's where the extra two mulps instructions come from, and there's the cmpps/blendvps.

sum_int_avx(float*):
    vpxor   ymm3, ymm3, ymm3
    xor     eax, eax
    vbroadcastss    ymm0, dword ptr [rip + .LCPI4_0]  ; set1(0.5)
    vbroadcastss    ymm1, dword ptr [rip + .LCPI4_1]  ; set1(3.0)
    vpbroadcastd    ymm2, dword ptr [rip + .LCPI4_2]  ; set1(63)
LBB4_1:                                               ; latencies
    vaddps      ymm4, ymm0, ymmword ptr [rdi + 4*rax] ; 3c
    vrsqrtps    ymm5, ymm4                            ; 7c
    vmulps      ymm4, ymm4, ymm5   ; x*nr             ; 5c
    vfnmadd213ps ymm5, ymm4, ymm1                     ; 5c
    vmulps      ymm4, ymm4, ymm5                      ; 5c
    vcvttps2dq  ymm4, ymm4                            ; 3c
    vpsrld      ymm4, ymm4, 1      ; 1c this would be a mulps (but not on the critical path) if we did this in the FP domain
    vpminsd     ymm4, ymm4, ymm2                      ; 1c
    vpaddd      ymm3, ymm4, ymm3                      ; 1c
    ; ... (those 9 insns repeated: loop unrolling)
    add     rax, 16
    cmp     rax, 4096
    jl      .LBB4_1
    ;... horizontal sum

IACA thinks that with no unroll, Haswell can sustain a throughput of one iteration per 4.15 cycles, bottlenecking on ports 0 and 1. So potentially you could shave a cycle by accumulating sqrt(x)*2 (with truncation to even numbers using _mm256_and_si256), and only divide by two outside the loop.

Also according to IACA, the latency of a single iteration is 38 cycles on Haswell. I only get 31c, so probably it's including L1 load-use latency or something. Anyway, this means that to saturate the execution units, operations from 8 iterations have to be in flight at once. That's 8 * ~14 unfused-domain uops = 112 unfused-uops (or less with clang's unroll) that have to be in flight at once. Haswell's scheduler is actually only 60 entries, but the ROB is 192 entries. The early uops from early iterations will already have executed, so they only need to be tracked in the ROB, not also in the scheduler. Many of the slow uops are at the beginning of each iteration, though. Still, there's reason to hope that this will come close-ish to saturating ports 0 and 1. Unless data is hot in L1 cache, cache/memory bandwidth will probably be the bottleneck.

Interleaving operations from multiple dep chains would also be better. When clang unrolls, it puts all 9 instructions for one iteration ahead of all 9 instructions for another iteration. It uses a surprisingly small number of registers, so it would be possible to have instructions for 2 or 4 iterations mixed together. This is the sort of thing compilers are supposed to be good at, but which is cumbersome for humans. :/

It would also be slightly more efficient if the compiler chose a one-register addressing mode, so the load could micro-fuse with the vaddps. gcc does this.