For IEEE-754 arithmetic, is there a guarantee of 0 or 1 units in the last place accuracy for reciprocals? From that, is there a guaranteed error-bound on the reciprocal of a reciprocal?
问题:
回答1:
[Everything below assumes a fixed IEEE 754 binary format, with some form of round-to-nearest as the rounding-mode.]
Since reciprocal (computed as 1/x
) is a basic arithmetic operation, 1
is exactly representable, and the arithmetic operations are guaranteed correctly rounded by the standard, the reciprocal result is guaranteed to be within 0.5
units in the last place of the true value. (This applies to any of the basic arithmetic operations specified in the standard.)
The reciprocal of the reciprocal of a value x
is not guaranteed to be equal to x
, in general. A quick example with the IEEE 754 binary64 format:
>>> x = 1.8
>>> 1.0 / (1.0 / x)
1.7999999999999998
>>> x == 1.0 / (1.0 / x)
False
However, assuming that overflow and underflow are avoided, and that x
is finite, nonzero, and exactly representable, the following results are true:
The value of
1.0 / (1.0 / x)
will differ fromx
by no more than 1 unit in the last place.If the significand of
x
(normalised to lie in the range[1.0, 2.0)
as usual) is smaller thansqrt(2)
, then the reciprocal does roundtrip:1.0 / (1.0 / x) == x
.
Sketch of proof: without loss of generality we can assume that x
is positive, and scale x
by a power of two so that it lies in the range [1.0, 2.0)
. The results above are clearly true in the case that x
is an exact power of two, so let's assume it's not (this will be useful in the second step below). The proof below is given for precision specific to the IEEE 754 binary64 format, but it adapts directly to any IEEE 754 binary format.
Write 1 / x
for the true value of the reciprocal, before rounding, and let y
be the (unique, as it turns out) nearest representable binary64 float to 1 / x
. Then:
since
y
is the closest float to1 / x
, and bothy
and1/x
are in the binade[0.5, 1.0]
, where the spacing between successive floats is exactly2^-53
, we have|y - 1/x| <= 2^-54
. In fact, we can do a bit better:we actually have a strict inequality above:
|y - 1/x| < 2^-54
. If|y - 1/x|
were exactly equal to2^-54
, then1/x
would be exactly representable in arbitrary-precision binary floating-point (since bothy
and2^-54
are). But the only binary floatsx
for which1/x
is exactly representable at some precision are powers of two, and we already excluded this case.if
x < sqrt(2)
then1 / x > x / 2
, hence (rounding both to the nearest representable float), we havey >= x / 2
, sox / y <= 2
.now
x - 1/y = (y - 1/x) x/y
, and from the bounds on|y - 1/x|
andx/y
(still assuming thatx < sqrt(2)
) we get|x - 1/y| < 2^-53
. It follows thatx
is the closest representable float to1/y
,1/y
rounds tox
and the roundtrip succeeds. This completes the proof of part 2.in the general case
x < 2
, we havex / y < 4
, so|x - 1/y| < 2^-52
. That makes1/y
at most 1 ulp away fromx
, which completes the proof of part 1.
Here's a demonstration of the sqrt(2)
threshold: using Python, we take a million random floats in the range [1.0, 2.0)
, and identify those that don't roundtrip through the reciprocal. All of the samples smaller than sqrt(2)
pass the roundtrip.
>>> import random
>>> samples = [random.uniform(1.0, 2.0) for _ in range(10**6)]
>>> bad = [x for x in samples if 1.0 / (1.0 / x) != x]
>>> len(bad)
171279
>>> min(bad)
1.4150519879892107
>>> import math
>>> math.sqrt(2)
1.4142135623730951
And a demonstration that the maximum error is no more than 1 ulp, in general (for the binary64 format, in the binade [1.0, 2.0), 1 unit in the last place is 2^-52):
>>> samples = [random.uniform(1.0, 2.0) for _ in range(10**6)]
>>> max(abs(x - 1.0 / (1.0 / x)) for x in samples)
2.220446049250313e-16
>>> 2**-52
2.220446049250313e-16
Here's an example with the IEEE 754 binary64 format showing that the restriction about avoiding underflow is necessary:
>>> x = 1.3e308
>>> x_roundtrip = 1.0 / (1.0 / x)
>>> x.hex()
'0x1.72409614c1e6ap+1023'
>>> x_roundtrip.hex()
'0x1.72409614c1e6cp+1023'
Here the x_roundtrip
result differs from the original by two units in the last place, because 1 / x
was smaller than the smallest normal representable float, and so was not represented to the same precision as x
.
Final note: since IEEE 754-2008 covers decimal floating-point types as well, I should mention that the above proof carries over almost verbatim to the decimal case, establishing that for floats with significand smaller than sqrt(10)
, roundtrip occurs, while for general decimal floats (again avoiding overflow and underflow) we're never off by more than one unit in the last place. However, there's some number-theoretic finesse required to show that the key inequality |x - 1/y| < 1/2 10^(1-p)
is always strict: one ends up having to show that the quantity 1 + 16 10^(2p-1)
is never a square number (which is true, but it's probably outside the ambit of this site to include the proof here).
>>> from decimal import Decimal, getcontext
>>> import random
>>> getcontext().prec = 6
>>> samples = [+Decimal(random.uniform(1.0, 10.0)) for _ in range(10**6)]
>>> bad = [x for x in samples if 1 / (1 / x) != x]
>>> min(bad)
Decimal('3.16782')