In various contexts, for example for the argument reduction for mathematical functions, one needs to compute (a - K) / (a + K)
, where a
is a positive variable argument and K
is a constant. In many cases, K
is a power of two, which is the use case relevant to my work. I am looking for efficient ways to compute this quotient more accurately than can be accomplished with the straightforward division. Hardware support for fused multiply-add (FMA) can be assumed, as this operation is provided by all major CPU and GPU architectures at this time, and is available in C/C++ via the functionsfma()
and fmaf()
.
For ease of exploration, I am experimenting with float
arithmetic. Since I plan to port the approach to double
arithmetic as well, no operations using higher than the native precision of both argument and result may be used. My best solution so far is:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 1 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q, -2.0f*K, m);
e = fmaf (q, -m, t);
q = fmaf (r, e, q);
For arguments a
in the interval [K/2, 4.23*K]
, code above computes the quotient almost correctly rounded for all inputs (maximum error is exceedingly close to 0.5 ulps), provided that K
is a power of 2, and there is no overflow or underflow in intermediate results. For K
not a power of two, this code is still more accurate than the naive algorithm based on division. In terms of performance, this code can be faster than the naive approach on platforms where the floating-point reciprocal can be computed faster than the floating-point division.
I make the following observation when K
= 2n: When the upper bound of the work interval increases to 8*K
, 16*K
, ... maximum error increases gradually and starts to slowly approximate the maximum error of the naive computation from below. Unfortunately, the same does not appear to be true for the lower bound of the interval. If the lower bound drops to 0.25*K
, the maximum error of the improved method above equals the maximum error of the naive method.
Is there a method to compute q = (a - K) / (a + K) that can achieve smaller maximum error (measured in ulp vs the mathematical result) compared to both the naive method and the above code sequence, over a wider interval, in particular for intervals whose lower bound is less than 0.5*K
? Efficiency is important, but a few more operations than are used in the above code can likely be tolerated.
In one answer below, it was pointed out that I could enhance accuracy by returning the quotient as an unevaluated sum of two operands, that is, as a head-tail pair q:qlo
, i.e. similar to the well-known double-float
and double-double
formats. In my code above, this would mean changing the last line to qlo = r * e
.
This approach is certainly useful, and I had already contemplated its use for an extended-precision logarithm for use in pow()
. But it doesn't fundamentally help with the desired widening of the interval on which the enhanced computation provides more accurate quotients. In a particular case I am looking at, I would like to use K=2
(for single precision) or K=4
(for double precision) to keep the primary approximation interval narrow, and the interval for a
is roughly [0,28]. The practical problem I am facing is that for arguments < 0.25*K the accuracy of the improved division is not substantially better than with the naive method.
One possibility is to track error of m and p into m1 and p1 with classical Dekker/Schewchuk:
Then, correct the naive division:
That'll cost you 2 divisions, but should be near half ulp if I didn't screw up.
But these divisions can be replaced by multiplications with inverse of p without any problem, since the first incorrectly rounded division will be compensated by remainder r, and second incorrectly rounded division does not really matter (the last bits of correction q1 won't change anything).
If a is large compared to K, then (a-K)/(a+K) = 1 - 2K / (a + K) will give a good approximation. If a is small compared to K, then 2a / (a + K) - 1 will give a good approximation. If K/2 ≤ a ≤ 2K, then a-K is an exact operation, so doing the division will give a decent result.
If you can relax the API to return another variable that models the error, then the solution becomes much simpler:
This solution only handles truncation error of division, but does not handle the loss of precision of
a+k
anda-k
.To handle those errors, I think I need to use double precision, or bithack to use fixed point.
Test code is updated to artificially generate non zero least significant bits in the input
test code
https://ideone.com/bHxAg8
I don't really have an answer (proper floating point error analyses are very tedious) but a few observations:
m
is computed exactly if a ∈ [0.5×Kb, 21+n×Kb), where Kb is the power of 2 below K (or K itself if K is a power of 2), and n is the number of trailing zeros in the significand of K (i.e. if K is a power of 2, then n=23).div2
algorithm from Dekker (1971): to expand the range (particularly the lower bound), you'll probably have to incorporate more correction terms from this (i.e. storem
as the sum of 2float
s, or use adouble
).The problem is the addition in
(a + K)
. Any loss of precision in(a + K)
is magnified by the division. The problem isn't the division itself.If the exponents of
a
andK
are the same (almost) no precision is lost, and if the absolute difference between the exponents is greater than the significand size then either(a + K) == a
(ifa
has larger magnitude) or(a + K) == K
(ifK
has larger magnitude).There is no way to prevent this. Increasing the significand size (e.g. using 80-bit "extended double" on 80x86) only helps widen the "accurate result range" slightly. To understand why, consider
smallest + largest
(wheresmallest
is the smallest positive denormal a 32-bit floating point number can be). In this case (for 32-bit floats) you'd need a significand size of about 260 bits for the result to avoid precision loss completely. Doing (e.g.)temp = 1/(a + K); result = a * temp - K / temp;
won't help much either because you've still got exactly the same(a + K)
problem (but it would avoid a similar problem in(a - K)
). Also you can't doresult = anything / p + anything_error/p_error
because division doesn't work like that.There are only 3 alternatives I can think of to get close to 0.5 ulps for all possible positive values of
a
that can fit in 32-bit floating point. None are likely to be acceptable.The first alternative involves pre-computing a lookup table (using "big real number" maths) for every value of
a
, which (with some tricks) ends up being about 2 GiB for 32-bit floating point (and completely insane for 64-bit floating point). Of course if the range of possible values ofa
is smaller than "any positive value that can fit in a 32-bit float" the size of the lookup table would be reduced.The second alternative is to use something else ("big real number") for the calculation at run-time (and convert to/from 32-bit floating point).
The third alternative involves, "something" (I don't know what it's called, but it's expensive). Set the rounding mode to "round to positive infinity" and calculate
temp1 = (a + K); if(a < K) temp2 = (a - K);
then switch to "round to negative infinity" and calculateif(a >= K) temp2 = (a - K); lower_bound = temp2 / temp1;
. Next doa_lower = a
and decreasea_lower
by the smallest amount possible and repeat the "lower_bound" calculation, and keep doing that until you get a different value forlower_bound
, then revert back to the previous value ofa_lower
. After that you do essentially the same (but opposite rounding modes, and incrementing not decrementing) to determineupper_bound
anda_upper
(starting with the original value ofa
). Finally, interpolate, likea_range = a_upper - a_lower; result = upper_bound * (a_upper - a) / a_range + lower_bound * (a - a_lower) / a_range;
. Note that you will want to calculate an initial upper and lower bound and skip all of this if they're equal. Also be warned that this is all "in theory, completely untested" and I probably borked it somewhere.Mainly what I'm saying is that (in my opinion) you should give up and accept that there's nothing that you can do to get close to 0.5 ulp. Sorry.. :)
Since my goal is to merely widen the interval on which accurate results are achieved, rather than to find a solution that works for all possible values of
a
, making use of double-float
arithmetic for all intermediate computation seems too costly.Thinking some more about the problem, it is clear that the computation of the remainder of the division,
e
in the code from my question, is the crucial part of achieving more accurate result. Mathematically, the remainder is (a-K) - q * (a+K). In my code, I simply usedm
to represent (a-K) and represented (a+k) asm + 2*K
, as this delivers numerically superior results to the straightforward representation.With relatively small additional computational cost, (a+K) can be represented as a double-
float
, that is, a head-tail pairp:plo
, which leads to the following modified version of my original code:Testing shows that this delivers nearly correctly rounded results for
a
in [K/2, 224*K), allowing for a substantial increase to the upper bound of the interval on which accurate results are achieved.Widening the interval at the lower end requires the more accurate representation of (a-K). We can compute this as a double-
float
head-tail pairm:mlo
, which leads to the following code variant:Exhaustive testing hows that this delivers nearly correctly rounded results for
a
in the interval [K/224, K*224). Unfortunately, this comes at a cost of ten additional operations compared to the code in my question, which is a steep price to pay to get the maximum error from around 1.625 ulps with the naive computation down to near 0.5 ulp.As in my original code from the question, one can express (a+K) in terms of (a-K), thus eliminating the computation of the tail of
p
,plo
. This approach results in the following code:This turns out to be advantageous if the main focus is decreasing the lower limit of the interval, which is my particular focus as explained in the question. Exhaustive testing of the single-precision case shows that when K=2n nearly correctly rounded results are produced for values of
a
in the interval [K/224, 4.23*K]. With a total of 14 or 15 operations (depending on whether an architecture supports full predication or just conditional moves), this requires seven to eight more operations than my original code.Lastly, one might base the residual computation directly on the original variable
a
to avoid the error inherent in the computation ofm
andp
. This leads to the following code that, for K = 2n, computes nearly correctly rounded results fora
in the interval [K/224, K/3):