For IEEE-754 arithmetic, is there a guarantee of 0 or 1 units in the last place accuracy for reciprocals? From that, is there a guaranteed error-bound on the reciprocal of a reciprocal?
相关问题
- Floating point again
- Is there a functional difference between “2.00” an
- How are these double precision values accurate to
- Having trouble rounding in c
- Having trouble rounding in c
相关文章
- How can I convert a f64 to f32 and get the closest
- Macro or function to construct a float (double) fr
- Math.Max vs Enumerable.Max
- `std::sin` is wrong in the last bit
- Converting hex string representation to float in p
- Rounding problem with rspec tests when comparing f
- PI and accuracy of a floating-point number
- String.format uses comma instead of point
[Everything below assumes a fixed IEEE 754 binary format, with some form of round-to-nearest as the rounding-mode.]
Since reciprocal (computed as
1/x
) is a basic arithmetic operation,1
is exactly representable, and the arithmetic operations are guaranteed correctly rounded by the standard, the reciprocal result is guaranteed to be within0.5
units in the last place of the true value. (This applies to any of the basic arithmetic operations specified in the standard.)The reciprocal of the reciprocal of a value
x
is not guaranteed to be equal tox
, in general. A quick example with the IEEE 754 binary64 format:However, assuming that overflow and underflow are avoided, and that
x
is finite, nonzero, and exactly representable, the following results are true:The value of
1.0 / (1.0 / x)
will differ fromx
by no more than 1 unit in the last place.If the significand of
x
(normalised to lie in the range[1.0, 2.0)
as usual) is smaller thansqrt(2)
, then the reciprocal does roundtrip:1.0 / (1.0 / x) == x
.Sketch of proof: without loss of generality we can assume that
x
is positive, and scalex
by a power of two so that it lies in the range[1.0, 2.0)
. The results above are clearly true in the case thatx
is an exact power of two, so let's assume it's not (this will be useful in the second step below). The proof below is given for precision specific to the IEEE 754 binary64 format, but it adapts directly to any IEEE 754 binary format.Write
1 / x
for the true value of the reciprocal, before rounding, and lety
be the (unique, as it turns out) nearest representable binary64 float to1 / x
. Then:since
y
is the closest float to1 / x
, and bothy
and1/x
are in the binade[0.5, 1.0]
, where the spacing between successive floats is exactly2^-53
, we have|y - 1/x| <= 2^-54
. In fact, we can do a bit better:we actually have a strict inequality above:
|y - 1/x| < 2^-54
. If|y - 1/x|
were exactly equal to2^-54
, then1/x
would be exactly representable in arbitrary-precision binary floating-point (since bothy
and2^-54
are). But the only binary floatsx
for which1/x
is exactly representable at some precision are powers of two, and we already excluded this case.if
x < sqrt(2)
then1 / x > x / 2
, hence (rounding both to the nearest representable float), we havey >= x / 2
, sox / y <= 2
.now
x - 1/y = (y - 1/x) x/y
, and from the bounds on|y - 1/x|
andx/y
(still assuming thatx < sqrt(2)
) we get|x - 1/y| < 2^-53
. It follows thatx
is the closest representable float to1/y
,1/y
rounds tox
and the roundtrip succeeds. This completes the proof of part 2.in the general case
x < 2
, we havex / y < 4
, so|x - 1/y| < 2^-52
. That makes1/y
at most 1 ulp away fromx
, which completes the proof of part 1.Here's a demonstration of the
sqrt(2)
threshold: using Python, we take a million random floats in the range[1.0, 2.0)
, and identify those that don't roundtrip through the reciprocal. All of the samples smaller thansqrt(2)
pass the roundtrip.And a demonstration that the maximum error is no more than 1 ulp, in general (for the binary64 format, in the binade [1.0, 2.0), 1 unit in the last place is 2^-52):
Here's an example with the IEEE 754 binary64 format showing that the restriction about avoiding underflow is necessary:
Here the
x_roundtrip
result differs from the original by two units in the last place, because1 / x
was smaller than the smallest normal representable float, and so was not represented to the same precision asx
.Final note: since IEEE 754-2008 covers decimal floating-point types as well, I should mention that the above proof carries over almost verbatim to the decimal case, establishing that for floats with significand smaller than
sqrt(10)
, roundtrip occurs, while for general decimal floats (again avoiding overflow and underflow) we're never off by more than one unit in the last place. However, there's some number-theoretic finesse required to show that the key inequality|x - 1/y| < 1/2 10^(1-p)
is always strict: one ends up having to show that the quantity1 + 16 10^(2p-1)
is never a square number (which is true, but it's probably outside the ambit of this site to include the proof here).