How does floating point error propagate when doing

2019-02-17 19:30发布

问题:

Let's say that we have declared the following variables

float a = 1.2291;

float b = 3.99;

float variables have precision 6, which (if I understand correctly) means that the difference between the number that the computer actually stores and the actual number that you want will be less than 10^-6

that means that both a and b have some error that is less than 10^-6

so inside the computer a could actually be 1.229100000012123 and b could be 3.9900000191919

now let's say that you have the following code

float c = 0;
for(int i = 0; i < 1000; i++)
      c += a + b;

my question is,

will c's final result have a precision error that is less than 10^-6 as well or not?

and if the answer is negative, how can we actually know this precision error and what exactly happens if you apply any kind of operations, as many times you wish and in any order?

回答1:

float variables have precision 6, which (if I understand correctly) means that the difference between the number that the computer actually stores and the actual number that you want will be less than 10^-6

that means that both a and b have some error that is less than 10^-6

The 10-6 figure is a rough measure of the relative accuracy when representing arbitrary constants as floats. Not all numbers will be represented with an absolute error of 10-6. The number 8765432.1, for instance, can be expected to be represented approximately to the unit. If you are at least a little bit lucky, you will get 8765432 when representing it as a float. On the other hand, 1E-15f can be expected to be represented with an absolute error of at most about 10-21.

so inside the computer a could actually be 1.229100000012123 and b could be 3.9900000191919

No, sorry, the way it works is not that you write the entire number and add six zeroes for the possible error. The error can be estimated by counting six zeroes from the leading digit, not from the last digit. Here, you could expect 1.22910012123 or 3.990000191919.

(Actually you would get exactly 1.2290999889373779296875 and 3.9900000095367431640625. Don't forget that representation error can be negative as well as positive, as it is for the first number.)

now let's say that you have the following code […]

my question is,

will c's final result have a precision error that is less than 10^-6 as well or not?

No. The total absolute error will be the sum of all the representation errors for a and b for each of the thousand times you used them, plus the errors of the 2000 additions you did. That's 4000 different sources of error! Many of them will be identical, some of them will happen to compensate each other, but the end result will probably not be to 10-6 relative accuracy, more like 10-5 relative accuracy (suggestion done without counting).



回答2:

This is a very good question and one that's been addressed for decades by many authorities and is a computer science discipline (for example) in itself. From What Every Computer Scientist Should Know About Floating-Point Arithmetic:

Floating-point arithmetic is considered an esoteric subject by many people. This is rather surprising because floating-point is ubiquitous in computer systems. Almost every language has a floating-point datatype; computers from PCs to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions such as overflow. This paper presents a tutorial on those aspects of floating-point that have a direct impact on designers of computer systems. It begins with background on floating-point representation and rounding error, continues with a discussion of the IEEE floating-point standard, and concludes with numerous examples of how computer builders can better support floating-point.

(Emphasis mine)



回答3:

The short answer is that you cannot easily determine the precision of a long chain of floating point operations.

The precision of an operation like "c += a + b" depends not only on the raw precision of the floating point implementation (which these days almost always is IEEE), but also on the actual values of a,b and c.

Further to that the compiler may chose to optimize the code in different ways which can result in unexpected issues, like transforming it to "c+=a; c+=b;" or simply do the loop as "tmp = a*1000; tmp += b*1000; c += tmp;" or some other variant which the compiler would determine resulting in faster execution time but the same result.

Bottom line is that analysis of precision is not possible by inspecting source code alone.

For that reason many simply just uses a higher precision implementation like double or long-double and then checks that precision issues are gone for all practical purposes.

If that does not suffice, then a fallback is always to implement all logic in integers and avoid floats.