We know that floating point is broken, because decimal numbers can't always be perfectly represented in binary. They're rounded to a number that can be represented in binary; sometimes that number is higher, and sometimes it's lower. In this case using the ubiquitous IEEE 754 double format both 0.1 and 0.4 round higher:
0.1 = 0.1000000000000000055511151231257827021181583404541015625
0.4 = 0.40000000000000002220446049250313080847263336181640625
Since both of these numbers are high, you'd expect their sum to be high as well. Perfect addition should give you 0.5000000000000000277555756156289135105907917022705078125
, but instead you get a nice exact 0.5
. Why?
The question Is floating point math broken? was already identified above, but this question is different. It is asking for a further level of detail on a non-intuitive result when taking the answers of that question into consideration.
This calculation behaves this way because the addition pushes the result into another (binary) order of magnitude. This adds a significant bit to the left (most-significant side) and therefore has to drop a bit on the right (least-significant side).
The first number, 0.1
, is stored in binary as a number between 2^-4 == 1/16
and 2^-3 == 1/8
. The second number, 0.4
, is stored in binary as a number between 2^-2 == 1/4
and 2^-1 == 1/2
. The sum, 0.5
, is the number 2^-1 == 1/2
or a little larger. This is a mis-match in magnitudes and can cause loss of digits.
Here is an example, easier to understand. Let's say we are working on a decimal computer that can store four decimal digits in floating point. Let's also say we want to add the numbers 10/3
and 20/3
. These may end up stored as
3.334
and
6.667
both of which are a little high. When we get those numbers, we expect the sum to be also a little high, namely
10.001
but notice that we have moved into a new order of magnitude with our result. The full result has five decimal digits, which will not fit. So the computer rounds the result to just four decimal digits and gets the sum
10.00
which surprisingly is the correct exact answer to 10/3 + 20/3
.
I get the same kind of thing often in my U.S. high-school Chemistry and Physics classes. When a calculation moves to a new order of magnitude, strange things happen with precision and significant digits.
Although most decimal numbers need to be rounded to fit into binary, some don't. 0.5
can be exactly represented in binary, since it's 2-1.
Floating point isn't just binary, it also has limited precision. Here is the exact sum and the two closest IEEE 754 double representable numbers on either side:
0.5000000000000000277555756156289135105907917022705078125
0.5000000000000000000000000000000000000000000000000000000
0.5000000000000001110223024625156540423631668090820312500
It's clear that the exact 0.5 is closest to the true sum. IEEE 754 has rules regarding simple math operations that dictate how result rounding will take place, and you can generally rely on the closest result to be taken.