I keep getting mixed answers of whether floating point numbers (i.e. float
, double
, or long double
) have one and only one value of precision, or have a precision value which can vary.
One topic called float vs. double precision seems to imply that floating point precision is an absolute.
However, another topic called Difference between float and double says,
In general a double has 15 to 16 decimal digits of precision
Another source says,
Variables of type float typically have a precision of about 7 significant digits
Variables of type double typically have a precision of about 16 significant digits
I don't like to refer to approximations like the above if I'm working with sensitive code that can break easily when my values are not exact. So let's set the record straight. Is floating point precision mutable or invariant, and why?
The type of a floating point variable defines what range of values and how many fractional bits (!) can be represented. As there is no integer relation between decimal and binary fraction, the decimal fraction is actually an approximation.
Second: Another problem is the precision arithmetic operations are performed. Just think of
1.0/3.0
or PI. Such values cannot be represented with a limited number of digits - neither decimal, nor binary. So the values have to be rounded to fit into the given space. The more fractional digits are available, the higher the precision.Now think of multiple such operations being applied, e.g. PI/3.0 . This would require to round twice: PI as such is not exact and the result neither. This will loose precision twice, if repreated it becomes worse.
So, back to
float
anddouble
:float
has according to the standard (C11, Annex F, also for the rest) less bits available, so roundig will be less precise than fordouble
. Just think of having a decimal with 2 fractional digits (m.ff, call it float) and one with four (m.ffff, call it double). If double is used for all calculations, you can have more operations until your result has only 2 correct fractional digits, than if you already start with float, even if a float result would suffice.Note that on some (embedded) CPUs like ARM Cortex-M4F, the hardware FPU only supports folat (single precision), so double arithmetic will be much more costly. Other MCUs have no hardware floating point calculator at all, so they have to be simulated my software (very costly). On most GPUs, float is also much cheaper to perform than double, sometimes by more than a factor of 10.
80x86 code using its hardware coprocessor (originally the 8087) provide three levels of precision: 32-bit, 64-bit, and 80-bit. Those very closely follow the IEEE-754 standard of 1985. The recent standard specifies a 128-bit format. The floating point formats have 24, 53, 65, and 113 mantissa bits which correspond to 7.22, 15.95, 19.57, and 34.02 decimal digits of precision.
While the precision of any particular implementation does not vary, it may appear to when a floating point value is converted to decimal. Note that the value
0.1
does not have an exact binary representation. It is a repeating bit pattern (0.0001100110011001100110011001100...) like we are used to in decimal for 0.3333333333333 to approximate 1/3.Many languages often don't support the 80-bit format. Some C compilers may offer
long double
which uses either 80-bit floats or 128-bit floats. Alas, it might also use a 64-bit float, depending on the implementation.The NPU has 80 bit registers and performs all operations using the full 80 bit result. Code which calculates within the NPU stack benefit from this extra precision. Unfortunately, poor code generation—or poorly written code— might truncate or round intermediate calculations by storing them in a 32-bit or 64-bit variable.
The precision is fixed, which is exactly 53 binary digits for double-precision (or 52 if we exclude the implicit leading 1). This comes out to about 15 decimal digits.
The OP asked me to elaborate on why having exactly 53 binary digits means "about" 15 decimal digits.
To understand this intuitively, let's consider a less-precise floating-point format: instead of a 52-bit mantissa like double-precision numbers have, we're just going to use a 4-bit mantissa.
So, each number will look like: (-1)s × 2yyy × 1.xxxx (where
s
is the sign bit,yyy
is the exponent, and1.xxxx
is the normalised mantissa). For the immediate discussion, we'll focus only on the mantissa and not the sign or exponent.Here's a table of what
1.xxxx
looks like for allxxxx
values (all rounding is half-to-even, just like how the default floating-point rounding mode works):How many decimal digits do you say that provides? You could say 2, in that each value in the two-decimal-digit range is covered, albeit not uniquely; or you could say 3, which covers all unique values, but do not provide coverage for all values in the three-decimal-digit range.
For the sake of argument, we'll say it has 2 decimal digits: the decimal precision will be the number of digits where all values of those decimal digits could be represented.
Okay, then, so what happens if we halve all the numbers (so we're using
yyy
= -1)?By the same criteria as before, we're now dealing with 1 decimal digit. So you can see how, depending on the exponent, you can have more or less decimal digits, because binary and decimal floating-point numbers do not map cleanly to each other.
The same argument applies to double-precision floating point numbers (with the 52-bit mantissa), only in that case you're getting either 15 or 16 decimal digits depending on the exponent.
No, it is variable. Starting point is the very weak IEEE-754 standard, it only nailed down the format of floating pointer numbers as they are stored in memory. You can count on 7 digits of precision for single precision, 15 digits for double precision.
But a major flaw in that standard is that it does not specify how calculations are to be performed. And there's trouble, the Intel 8087 floating point processor in particular has caused programmers many sleepless nights. A significant design flaw in that chip is that it stores floating point values with more bits than the memory format. 80 bits instead of 32 or 64. The theory behind that design choice is that this allows to be intermediate calculations to be more precise and cause less round-off error.
Sounds like a good idea, that however did not turn out well in practice. A compiler writer will try to generate code that leaves intermediate values stored in the FPU as long as possible. Important to code speed, storing the value back to memory is expensive. Trouble is, he often must store values back, the number of registers in the FPU are limited and the code might cross a function boundary. At which point the value gets truncated back and loses a lot of precision. Small changes to the source code can now produce drastically different values. Also, the non-optimized build of a program produces different results from the optimized one. In a completely undiagnosable way, you'd have to look at the machine code to know why the result is different.
Intel redesigned their processor to solve this problem, the SSE instruction set calculates with the same number of bits as the memory format. Slow to catch on however, redesigning the code generator and optimizer of a compiler is a significant investment. The big three C++ compilers have all switched. But for example the x86 jitter in the .NET Framework still generates FPU code, it always will.
Then there is systemic error, losing precision as inevitable side-effect of the conversion and calculation. Conversion first, humans work in numbers in base 10 but the processor uses base 2. Nice round numbers we use, like 0.1 cannot be converted to nice round numbers on the processor. 0.1 is perfect as a sum of powers of 10 but there is no finite sum of powers of 2 that produce the same value. Converting it produces an infinite number of 1s and 0s in the same manner that you can't perfectly write down 10 / 3. So it needs to be truncated to fit the processor and that produces a value that's off by +/- 0.5 bit from the decimal value.
And calculation produces error. A multiplication or division doubles the number of bits in the result, rounding it to fit it back into the stored value produces +/- 0.5 bit error. Subtraction is the most dangerous operation and can cause loss of a lot of significant digits. If you, say, calculate 1.234567f - 1.234566f then the result has only 1 significant digit left. That's a junk result. Summing the difference between numbers that have nearly the same value is a very common in numerical algorithms.
Getting excessive systemic errors is ultimately a flaw in the mathematical model. Just as an example, you never want to use Gaussian elimination, it is very unfriendly to precision. And always consider an alternative approach, LU Decomposition is an excellent approach. It is however not that common that a mathematician was involved in building the model and accounted for the expected precision of the result. A common book like Numerical Recipes also doesn't pay enough attention to precision, albeit that it indirectly steers you away from bad models by proposing the better one. In the end, a programmer often gets stuck with the problem. Well, it was easy then anybody could do it and I'd be out of a good paying job :)