double precision error when converting to scientif

2019-07-24 12:12发布

问题:

I'm building a program to to convert double values in to scientific value format(mantissa, exponent). Then I noticed the below

369.7900000000000 -> 3.6978999999999997428

68600000 -> 6.8599999999999994316

I noticed the same pattern for several other values also. The maximum fractional error is

0.000 000 000 000 001 = 1*e-15

I know the inaccuracy in representing double values in a computer. Can this be concluded that the maximum fractional error we would get is 1*e-15? What is significant about this?

I went through most of the questions on floating point precision problem in stack overflow, but I didnt see any about the maximum fractional error in 64 bits.

To be clear on the computation I do, I have mentioned my code snippet as well

double norm = 68600000;
if (norm)
{
    while (norm >= 10.0)
    {
      norm /= 10.0;
      exp++;
    }
    while (norm < 1.0)
    {
      norm *= 10.0;
      exp--;
    }
}

Now I get

norm = 6.8599999999999994316;
exp = 7

回答1:

The number you are getting is related to the machine epsilon for the double data type.

A double is 64 bits long, with 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa fraction. A double's value is given by

1.mmmmm... * (2^exp)

With only 52 bits for the mantissa, any double value below 2^-52 will be completely lost when added to 1.0 due to its small significance. In binary, 1.0 + 2^-52 would be

1.000...00  + 0.000...01  = 1.000.....01

Obviously anything lower would not change the value of 1.0. You can verify for yourself that 1.0 + 2^-53 == 1.0 in a program.

This number 2^-52 = 2.22e-16 is called the machine epsilon and is an upper bound on the relative error that occurs during one floating point arithmetic due to round-off error with double values.

Similarly, float has 23 bits in its mantissa and so its machine epsilon is 2^-23 = 1.19e-7.

The reason you are getting 1e-15 may be because errors accumulate as you perform many arithmetic operations, but I can't say because I don't know the exact calculations you are doing.


EDIT: I've looked into the relative error for your problem with 68600000.

First off, you may be interested to know that round-off error can change the result of your computation if you break it into steps:

686.0/10.0      = 68.59999999999999431566
686.0/10.0/10.0 = 6.85999999999999943157
686.0/100.0     = 6.86000000000000031974

In the first line, the closest double to 68.6 is lower than the actual value, but in the third line we see the closest double to 6.86 is greater.

If we look at the abosolute error e_abs = abs(v-v_approx) of your program, we see that it is

6.8600000 - 6.85999999999999943156581139192 ~= 5.684e-16

However, the relative error e_abs = abs( (v-v_approx)/ v) = abs(e_abs/v) would be

5.684e-16 / 6.86  ~=  8.286e-17

Which is indeed below our machine epsilon of 2.22e-16.

This is a famous paper you can read if you want to know all the details about floating point arithmetic.