可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
In various contexts, for example for the argument reduction for mathematical functions, one needs to compute (a - K) / (a + K)
, where a
is a positive variable argument and K
is a constant. In many cases, K
is a power of two, which is the use case relevant to my work. I am looking for efficient ways to compute this quotient more accurately than can be accomplished with the straightforward division. Hardware support for fused multiply-add (FMA) can be assumed, as this operation is provided by all major CPU and GPU architectures at this time, and is available in C/C++ via the functionsfma()
and fmaf()
.
For ease of exploration, I am experimenting with float
arithmetic. Since I plan to port the approach to double
arithmetic as well, no operations using higher than the native precision of both argument and result may be used. My best solution so far is:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 1 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q, -2.0f*K, m);
e = fmaf (q, -m, t);
q = fmaf (r, e, q);
For arguments a
in the interval [K/2, 4.23*K]
, code above computes the quotient almost correctly rounded for all inputs (maximum error is exceedingly close to 0.5 ulps), provided that K
is a power of 2, and there is no overflow or underflow in intermediate results. For K
not a power of two, this code is still more accurate than the naive algorithm based on division. In terms of performance, this code can be faster than the naive approach on platforms where the floating-point reciprocal can be computed faster than the floating-point division.
I make the following observation when K
= 2n: When the upper bound of the work interval increases to 8*K
, 16*K
, ... maximum error increases gradually and starts to slowly approximate the maximum error of the naive computation from below. Unfortunately, the same does not appear to be true for the lower bound of the interval. If the lower bound drops to 0.25*K
, the maximum error of the improved method above equals the maximum error of the naive method.
Is there a method to compute q = (a - K) / (a + K) that can achieve smaller maximum error (measured in ulp vs the mathematical result) compared to both the naive method and the above code sequence, over a wider interval, in particular for intervals whose lower bound is less than 0.5*K
? Efficiency is important, but a few more operations than are used in the above code can likely be tolerated.
In one answer below, it was pointed out that I could enhance accuracy by returning the quotient as an unevaluated sum of two operands, that is, as a head-tail pair q:qlo
, i.e. similar to the well-known double-float
and double-double
formats. In my code above, this would mean changing the last line to qlo = r * e
.
This approach is certainly useful, and I had already contemplated its use for an extended-precision logarithm for use in pow()
. But it doesn't fundamentally help with the desired widening of the interval on which the enhanced computation provides more accurate quotients. In a particular case I am looking at, I would like to use K=2
(for single precision) or K=4
(for double precision) to keep the primary approximation interval narrow, and the interval for a
is roughly [0,28]. The practical problem I am facing is that for arguments < 0.25*K the accuracy of the improved division is not substantially better than with the naive method.
回答1:
If a is large compared to K, then (a-K)/(a+K) = 1 - 2K / (a + K) will give a good approximation. If a is small compared to K, then 2a / (a + K) - 1 will give a good approximation. If K/2 ≤ a ≤ 2K, then a-K is an exact operation, so doing the division will give a decent result.
回答2:
One possibility is to track error of m and p into m1 and p1 with classical Dekker/Schewchuk:
m=a-k;
k0=a-m;
a0=k0+m;
k1=k0-k;
a1=a-a0;
m1=a1+k1;
p=a+k;
k0=p-a;
a0=p-k0;
k1=k-k0;
a1=a-a0;
p1=a1+k1;
Then, correct the naive division:
q=m/p;
r0=fmaf(p,-q,m);
r1=fmaf(p1,-q,m1);
r=r0+r1;
q1=r/p;
q=q+q1;
That'll cost you 2 divisions, but should be near half ulp if I didn't screw up.
But these divisions can be replaced by multiplications with inverse of p without any problem, since the first incorrectly rounded division will be compensated by remainder r, and second incorrectly rounded division does not really matter (the last bits of correction q1 won't change anything).
回答3:
I don't really have an answer (proper floating point error analyses are very tedious) but a few observations:
- Fast reciprocal instructions (such as RCPSS) are not as accurate as division, so you may see a reduction in accuracy if using these.
m
is computed exactly if a ∈ [0.5×Kb, 21+n×Kb), where Kb is the power of 2 below K (or K itself if K is a power of 2), and n is the number of trailing zeros in the significand of K (i.e. if K is a power of 2, then n=23).
- This is similar to a simplified form of the
div2
algorithm from Dekker (1971): to expand the range (particularly the lower bound), you'll probably have to incorporate more correction terms from this (i.e. store m
as the sum of 2 float
s, or use a double
).
回答4:
Since my goal is to merely widen the interval on which accurate results are achieved, rather than to find a solution that works for all possible values of a
, making use of double-float
arithmetic for all intermediate computation seems too costly.
Thinking some more about the problem, it is clear that the computation of the remainder of the division, e
in the code from my question, is the crucial part of achieving more accurate result. Mathematically, the remainder is (a-K) - q * (a+K). In my code, I simply used m
to represent (a-K) and represented (a+k) as m + 2*K
, as this delivers numerically superior results to the straightforward representation.
With relatively small additional computational cost, (a+K) can be represented as a double-float
, that is, a head-tail pair p:plo
, which leads to the following modified version of my original code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 2 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mx = fmaxf (a, K);
mn = fminf (a, K);
plo = (mx - p) + mn;
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
q = fmaf (r, e, q);
Testing shows that this delivers nearly correctly rounded results for a
in [K/2, 224*K), allowing for a substantial increase to the upper bound of the interval on which accurate results are achieved.
Widening the interval at the lower end requires the more accurate representation of (a-K). We can compute this as a double-float
head-tail pair m:mlo
, which leads to the following code variant:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 3 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
plo = (a < K) ? ((K - p) + a) : ((a - p) + K);
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
e = e + mlo;
q = fmaf (r, e, q);
Exhaustive testing hows that this delivers nearly correctly rounded results for a
in the interval [K/224, K*224). Unfortunately, this comes at a cost of ten additional operations compared to the code in my question, which is a steep price to pay to get the maximum error from around 1.625 ulps with the naive computation down to near 0.5 ulp.
As in my original code from the question, one can express (a+K) in terms of (a-K), thus eliminating the computation of the tail of p
, plo
. This approach results in the following code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 4 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -2.0f*K, m);
t = fmaf (q, -m, t);
e = fmaf (q - 1.0f, -mlo, t);
q = fmaf (r, e, q);
This turns out to be advantageous if the main focus is decreasing the lower limit of the interval, which is my particular focus as explained in the question. Exhaustive testing of the single-precision case shows that when K=2n nearly correctly rounded results are produced for values of a
in the interval [K/224, 4.23*K]. With a total of 14 or 15 operations (depending on whether an architecture supports full predication or just conditional moves), this requires seven to eight more operations than my original code.
Lastly, one might base the residual computation directly on the original variable a
to avoid the error inherent in the computation of m
and p
. This leads to the following code that, for K = 2n, computes nearly correctly rounded results for a
in the interval [K/224, K/3):
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 5 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q + 1.0f, -K, a);
e = fmaf (q, -a, t);
q = fmaf (r, e, q);
回答5:
If you can relax the API to return another variable that models the error, then the solution becomes much simpler:
float foo(float a, float k, float *res)
{
float ret=(a-k)/(a+k);
*res = fmaf(-ret,a+k,a-k)/(a+k);
return ret;
}
This solution only handles truncation error of division, but does not handle the loss of precision of a+k
and a-k
.
To handle those errors, I think I need to use double precision, or bithack to use fixed point.
Test code is updated to artificially generate non zero least significant bits
in the input
test code
https://ideone.com/bHxAg8
回答6:
The problem is the addition in (a + K)
. Any loss of precision in (a + K)
is magnified by the division. The problem isn't the division itself.
If the exponents of a
and K
are the same (almost) no precision is lost, and if the absolute difference between the exponents is greater than the significand size then either (a + K) == a
(if a
has larger magnitude) or (a + K) == K
(if K
has larger magnitude).
There is no way to prevent this. Increasing the significand size (e.g. using 80-bit "extended double" on 80x86) only helps widen the "accurate result range" slightly. To understand why, consider smallest + largest
(where smallest
is the smallest positive denormal a 32-bit floating point number can be). In this case (for 32-bit floats) you'd need a significand size of about 260 bits for the result to avoid precision loss completely. Doing (e.g.) temp = 1/(a + K); result = a * temp - K / temp;
won't help much either because you've still got exactly the same (a + K)
problem (but it would avoid a similar problem in (a - K)
). Also you can't do result = anything / p + anything_error/p_error
because division doesn't work like that.
There are only 3 alternatives I can think of to get close to 0.5 ulps for all possible positive values of a
that can fit in 32-bit floating point. None are likely to be acceptable.
The first alternative involves pre-computing a lookup table (using "big real number" maths) for every value of a
, which (with some tricks) ends up being about 2 GiB for 32-bit floating point (and completely insane for 64-bit floating point). Of course if the range of possible values of a
is smaller than "any positive value that can fit in a 32-bit float" the size of the lookup table would be reduced.
The second alternative is to use something else ("big real number") for the calculation at run-time (and convert to/from 32-bit floating point).
The third alternative involves, "something" (I don't know what it's called, but it's expensive). Set the rounding mode to "round to positive infinity" and calculate temp1 = (a + K); if(a < K) temp2 = (a - K);
then switch to "round to negative infinity" and calculate if(a >= K) temp2 = (a - K); lower_bound = temp2 / temp1;
. Next do a_lower = a
and decrease a_lower
by the smallest amount possible and repeat the "lower_bound" calculation, and keep doing that until you get a different value for lower_bound
, then revert back to the previous value of a_lower
. After that you do essentially the same (but opposite rounding modes, and incrementing not decrementing) to determine upper_bound
and a_upper
(starting with the original value of a
). Finally, interpolate, like a_range = a_upper - a_lower; result = upper_bound * (a_upper - a) / a_range + lower_bound * (a - a_lower) / a_range;
. Note that you will want to calculate an initial upper and lower bound and skip all of this if they're equal. Also be warned that this is all "in theory, completely untested" and I probably borked it somewhere.
Mainly what I'm saying is that (in my opinion) you should give up and accept that there's nothing that you can do to get close to 0.5 ulp. Sorry.. :)