Why do certain floating point calculations turn th

2019-08-21 08:20发布

问题:

I'm trying to get a better understanding of floating point arithmetic, the attending errors that occur and accrue, as well as why exactly the results turn out the way they do. Here are 3 Examples in particular I'm currently working on:

1.) 0.1+0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 -1.0 = -1.1102230246251565E-16 aka adding 0.1 10 times gives me a number slightly less than 1.0. However, 0.1 is represented (as a double) as slightly larger than 0.1. Also *0.1*3* is slightly larger than 0.3, but *0.1*8* slightly smaller that 0.8

2.) 123456789f+1 = 123456792 and 123456789f +4 = 123456800.

What's up with those results? It's all still a bit mysterious to me.

回答1:

Typical modern processors and programming languages use IEEE-754 arithmetic (more or less) with 32-bit binary floating-point for float and 64-bit binary floating-point for double. In double, a 53-bit significand is used. This means that, when a decimal numeral is converted to double, it is converted to some number sf•2e, where s is a sign (+1 or −1), f is an unsigned integer that can be represented in 53 bits, and e is an integer between −1074 and 971, inclusive. (Or, if the number being converted is too large, the result can be +infinity or -infinity.) (Those who know the floating-point format may complain that the exponent is properly between −1023 and 1023, but I have shifted the significand to make it an integer. I am describing the mathematical value, not the encoding.)

Converting .1 to double yields 3602879701896397 / 36028797018963968, because, of all the numbers in the required form, that one is closest to .1. The denominator is 2−55, so e is −55.

When we add two of these, we get 7205759403792794 / 36028797018963968. That is fine, the numerator is still less than 253, so it fits in the format.

When we add a third 3602879701896397 / 36028797018963968, the mathematical result is 10808639105689191 / 36028797018963968. Unfortunately, the numerator is too large; it is larger than 253 (9007199254740992). So the floating-point hardware cannot return that number. It has to make it fit somehow.

If we divide the numerator and the denominator by two, we have 5404319552844595.5 / 18014398509481984. This has the same value, but the numerator is not an integer. To make it fit, the hardware rounds it to an integer. When the fraction is exactly 1/2, the rule is to round to make the result even, so the hardware returns 5404319552844596 / 18014398509481984.

Next, we take the current sum, 5404319552844596 / 18014398509481984, and add 3602879701896397 / 36028797018963968 again. This time, the sum is 7205759403792794.5 / 18014398509481984. In this case, the hardware rounds down, returning 7205759403792794 / 18014398509481984.

Then we add 7205759403792794 / 18014398509481984 and 3602879701896397 / 36028797018963968, and the sum is 9007199254740992.5 / 18014398509481984. Note that the numerator not only has a fraction but is larger than 253. So we have to reduce it again, which produces 4503599627370496.25 / 9007199254740992. Rounding the numerator to an integer produces 4503599627370496 / 9007199254740992.

That is exactly 1/2. At this point, the rounding errors have coincidentally canceled; add .1 five times yields exactly .5.

When we add 4503599627370496 / 9007199254740992 and 3602879701896397 / 36028797018963968, the result is exactly 5404319552844595.25 / 9007199254740992. The hardware rounds down and returns 5404319552844595 / 9007199254740992.

Now you can see we are going to round down repeatedly. To add 3602879701896397 / 36028797018963968 to the accumulating sum, the hardware has to divide its numerator by four to make it match. That means the fraction part is always going to be .25, and it will be rounded down. So the next four sums are also rounded down. We end up with 9007199254740991 / 9007199254740992, which is just less than 1.

With float instead of double, the numerator has to fit in 24 bits, so it has to be less than 224 (16777216). So 123456789 is too big even before any arithmetic is done. It has to be expressed as 15432099 • 23, which is 123456792. The exact mathematical result of adding 1 is 15432099.125 • 23, and rounding that significand to an integer yields 15432099 • 23, so there is no change. But, if you add four, the result is 15432099.5 • 23, and that rounds to 15432100 • 23.