Vectorizable implementation of complementary error

2020-03-02 03:32发布

问题:

The complementary error function, erfc, is a special functions closely related to the standard normal distribution. It is frequently used in statistics and the natural sciences (e.g. diffusion problems) where the "tails" of this distribution need to be considered, and use of the error function, erf, is therefore not suitable.

The complementary error function was made available in the ISO C99 standard math library as the functions erfcf, erfc, and erfcl; these were subsequently adopted into ISO C++ as well. Thus source code can readily be found in open-source implementations of that library, for example in glibc.

However, many existing implementations are scalar in nature, while modern processor hardware is SIMD-oriented (either explicitly, as in x86 CPU, or implicitly, as in GPUs). For performance reasons, a vectorizable implementation is therefore highly desirable. This means branches need to be avoided, except as part of select assignment. Likewise, extensive use of tables is not indicated, as parallelized lookup is often inefficient.

How would one go about constructing an efficient vectorizable implementation of the single-precision function erfcf()? The accuracy, as measured in ulp, should be roughly the same as glibc's scalar implementation, which has a maximum error of 3.12575 ulps (determined by exhaustive testing). The availability of fused multiply-add (FMA) can be assumed, as all major processor architectures (CPUs and GPUs) offer it at this time. While handling of floating-point status flags and errno can be ignored, denormals, infinities, and NaNs should be handled in accordance with the IEEE 754 bindings for ISO C.

回答1:

After looking into various approaches, the one that seems most suitable is the algorithm proposed in the following paper:

M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of (1 + 2 x) exp(x2) erfc x in 0 ≤ x < ∞." Mathematics of Computation, Volume 36, No. 153, January 1981, pp. 249-253 (online copy)

The basic idea of the paper is to create an approximation to (1 + 2 x) exp(x2) erfc(x), from which we can compute erfcx(x) by simply dividing by (1 + 2 x), and erfc(x) by then multiplying with exp(-x2). The tightly bounded range of the function, with function values roughly in [1, 1.3], and its general "flatness" lend itself well to polynomial approximation. Numerical properties of this approach are further improved by narrowing the approximation interval: the original argument x is transformed by q = (x - K) / (x + K), where K is a suitably chosen constant, followed by computing p (q), where p is a polynomial.

Since erfc -x = 2 - erfc x, we only need to consider the interval [0, ∞] which is mapped to the interval [-1, 1] by this transformation. For IEEE-754 single-precision, erfcf() vanishes (becomes zero) for x > 10.0546875, so one needs to consider only x ∈ [0, 10.0546875). What is the "optimal' value of K for this range? I know of no mathematical analysis that would provide the answer, the paper suggests K = 3.75 based on experiments.

One can readily establish that for single-precision computation, a minimax polynomial approximation of degree 9 is sufficient for various values of K in that general vicinity. Systematically generating such approximations with the Remez algorithm, with K varying between 1.5 and 4 in steps of 1/16, lowest approximation error is observed for K = {2, 2.625, 3.3125}. Of these, K = 2 is the most advantageous choice, since it lends itself to very accurate computation of (x - K) / (x + K), as shown in this question.

The value K = 2 and the input domain for x would suggest that it is necessary to use variant 4 from my answer, however once can demonstrate experimentally that the less expensive variant 5 achieves the same accuracy here, which is likely due to the very shallow slope of the approximated function for q > -0.5, which causes any error in the argument q to be reduced by roughly a factor of ten.

Since computation of erfc() requires post-processing steps in addition to the initial approximation, it is clear that the accuracy of both of these computations must be high in order to achieve a sufficiently accurate final result. Error correcting techniques must be used.

One observes that the most significant coefficient in the polynomial approximation of (1 + 2 x) exp(x2) erfc(x) is of the form (1 + s), where s < 0.5. This means we can represent the leading coefficient more accurately by splitting off 1, and only using s in the polynomial. So instead of computing a polynomial p(q), then multiplying by the reciprocal r = 1 / (1 + 2 x), it is mathematically equivalent but numerically advantageous to compute the core approximation as p(q) + 1, and use p to compute fma (p, r, r).

The accuracy of the division can be enhanced by computing an initial quotient q from the reciprocal r, compute the residual e = p+1 - q * (1 + 2 x) with the help of an FMA, then use e to apply the correction q = q + (e * r), again using an FMA.

Exponentiation has error magnification properties, therefore computation of e-x2 must be performed carefully. The availability of FMA trivially allows the computation of -x2 as a double-float shigh:slow. ex is its own derivative, so one can compute eshigh:slow as eshigh + eshigh * slow. This computation can be combined with the multiplication of the previous intermediate result r to yield r = r * eshigh + r * eshigh * slow. By use of FMA, one ensures that the most significant term r * eshigh is computed as accurately as possible.

Combining the steps above with a few simple selections to handle exceptional cases and negative arguments, one arrives at the following C code:

float my_expf (float);

/*  
 * Based on: M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of 
 * (1+2x)exp(x^2)erfc x in 0 <= x < INF", Mathematics of Computation, Vol. 36,
 * No. 153, January 1981, pp. 249-253.  
 */  
float my_erfcf (float x)
{
    float a, d, e, m, p, q, r, s, t;

    a = fabsf (x); 

    /* Compute q = (a-2)/(a+2) accurately. [0, 10.0546875] -> [-1, 0.66818] */
    m = a - 2.0f;
    p = a + 2.0f;
    r = 1.0f / p;
    q = m * r;
    t = fmaf (q + 1.0f, -2.0f, a); 
    e = fmaf (q, -a, t); 
    q = fmaf (r, e, q); 

    /* Approximate (1+2*a)*exp(a*a)*erfc(a) as p(q)+1 for q in [-1, 0.66818] */
    p =             -0x1.a48024p-12f;  // -4.01020574e-4
    p = fmaf (p, q, -0x1.42a172p-10f); // -1.23073824e-3
    p = fmaf (p, q,  0x1.585784p-10f); //  1.31355994e-3
    p = fmaf (p, q,  0x1.1ade24p-07f); //  8.63243826e-3
    p = fmaf (p, q, -0x1.081b72p-07f); // -8.05991236e-3
    p = fmaf (p, q, -0x1.bc0b94p-05f); // -5.42047396e-2
    p = fmaf (p, q,  0x1.4ffc40p-03f); //  1.64055347e-1
    p = fmaf (p, q, -0x1.540840p-03f); // -1.66031361e-1
    p = fmaf (p, q, -0x1.7bf612p-04f); // -9.27639678e-2
    p = fmaf (p, q,  0x1.1ba03ap-02f); //  2.76978403e-1

    /* Divide (1+p) by (1+2*a) ==> exp(a*a)*erfc(a) */
    t = a + a;
    d = t + 1.0f;
    r = 1.0f / d;
    q = fmaf (p, r, r); // q = (p+1)/(1+2*a)
    e = (p - q) + fmaf (q, -t, 1.0f); // (p+1) - q*(1+2*a)
    r = fmaf (e, r, q);

    /* Multiply by exp(-a*a) ==> erfc(a) */
    s = a * a; 
    e = my_expf (-s);  
    t = fmaf (a, -a, s);
    r = fmaf (r, e, r * e * t);

    /* Handle NaN arguments to erfc() */
    if (!(a <= 0x1.fffffep127f)) r = x + x;

    /* Clamp result for large arguments */
    if (a > 10.0546875f) r = 0.0f;

    /* Handle negative arguments to erfc() */
    if (x < 0.0f) r = 2.0f - r; 

    return r;
}

/* Compute exponential base e. Maximum ulp error = 0.87161 */
float my_expf (float a)
{
    float c, f, r;
    int i;

    // exp(a) = exp(i + f); i = rint (a / log(2)) 
    c = 0x1.800000p+23f; // 1.25829120e+7
    r = fmaf (0x1.715476p+0f, a, c) - c; // 1.44269502e+0
    f = fmaf (r, -0x1.62e400p-01f, a); // -6.93145752e-1 // log_2_hi 
    f = fmaf (r, -0x1.7f7d1cp-20f, f); // -1.42860677e-6 // log_2_lo
    i = (int)r;
    // approximate r = exp(f) on interval [-log(2)/2,+log(2)/2]
    r =             0x1.6a98dap-10f;  // 1.38319808e-3
    r = fmaf (r, f, 0x1.1272cap-07f); // 8.37550033e-3
    r = fmaf (r, f, 0x1.555a20p-05f); // 4.16689515e-2
    r = fmaf (r, f, 0x1.55542ep-03f); // 1.66664466e-1
    r = fmaf (r, f, 0x1.fffff6p-02f); // 4.99999851e-1
    r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
    r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
    // exp(a) = 2**i * exp(f);
    r = ldexpf (r, i);
    // handle special cases
    if (!(fabsf (a) < 104.0f)) {
        r = a + a; // handle NaNs
        if (a < 0.0f) r = 0.0f;
        if (a > 0.0f) r = 1e38f * 1e38f; // + INF
    }
    return r;
}

I used my own implementation of expf() in the above code to isolate my work from differences in the expf() implementations on different compute platforms. But any implementation of expf() whose maximum error is close to 0.5 ulp should work well. As shown above, that is, when using my_expf(), my_erfcf() has a maximum error of 2.65712 ulps. Provided availability of a vectorizable expf(), the code above should vectorize without problem. I did a quick check with the Intel compiler 13.1.3.198. I put a call to my_erfcf() in a loop, added #include <mathimf.h>, replaced the call to my_expf() with a call to expf(), then compiled using these command line switches:

/Qstd=c99 /O3 /QxCORE-AVX2 /fp:precise /Qfma /Qimf-precision:high:expf /Qvec_report=2

The Intel compiler reported that the loop had been vectorized, which I double checked by inspection of the disassembled binary code.

Since my_erfcf() only uses reciprocals rather than full divisions, it is amenable to the use of fast reciprocal implementations, provided they deliver almost correctly-rounded results. For processors that provide a fast single-precision reciprocal approximation in hardware, this can easily be achieved by coupling this with a Halley iteration with cubic convergence. A (scalar) example of this approach for x86 processors is:

/* Compute 1.0f / a almost correctly rounded. Halley iteration with cubic convergence */
float fast_recipf (float a)
{
    __m128 t;
    float e, r;
    t = _mm_set_ss (a);
    t = _mm_rcp_ss (t);
    _mm_store_ss (&r, t);
    e = fmaf (r, -a, 1.0f);
    e = fmaf (e, e, e);
    r = fmaf (e, r, r);
    return r;
}