I keep getting mixed answers of whether floating point numbers (i.e. float
, double
, or long double
) have one and only one value of precision, or have a precision value which can vary.
One topic called float vs. double precision seems to imply that floating point precision is an absolute.
However, another topic called Difference between float and double says,
In general a double has 15 to 16 decimal digits of precision
Another source says,
Variables of type float typically have a precision of about 7 significant digits
Variables of type double typically have a precision of about 16 significant digits
I don't like to refer to approximations like the above if I'm working with sensitive code that can break easily when my values are not exact. So let's set the record straight. Is floating point precision mutable or invariant, and why?
The amount of space required to store a
float
will be constant, and likewise adouble
; the amount of useful precision will in relative terms generally vary, however, between one part in 223 and one part in 224 forfloat
, or one part in 252 and 253 fordouble
. Precision very near zero isn't that good, with the second-smallest positive value being twice as big as the smallest, which will in turn be infinitely greater than zero. Throughout the most of the range, however, precision will vary as described above.Note that while it often isn't practical to have types whose relative precision varies by less than a factor of two throughout its range, the variation in precision can sometimes cause calculations to yield much less accurate calculations than it would appear they should. Consider, for example,
16777215.0f + 4.0f - 4.0f
. All of the values would be precisely representable asfloat
using the same scale, and the nearest values to the large one are +/- one part in 16,777,215, but the first addition yields a result in part of thefloat
range where values are separated by one part in only 8,388,610, causing the result to be rounded to 16,777,220. Consequently, subtracting 4 yields 16,777,216 rather than 16,777,215. For most values offloat
near16777216
, adding4.0f
and subtracting4.0f
would yield the original value unchanged, but the changing precision right at the break-over point causes the result to be off by an extra bit in the lowest place.I'm going to add the off-beat answer here, and say that since you've tagged this question as C++, there is no guarantee whatsoever about precision of floating point data. The vast majority of implementations use IEEE-754 when implementing their floating point types, but that is not required. The only thing required by the C++ language is that (C++ spec §3.9.1.8):
Well the answer to this is simple but complicated. These numbers are stored in binary. Depending on if it is a float or a double, the computer uses different amounts of binary to store the number. The precision that you get depends on your binary. If you don't know how binary numbers work, it would be a good idea to look it up. But simply put, some numbers need more ones and zeros than other numbers.
So the precision is fixed (same number of binary digits), but the actual precision that you get depends on the numbers that you are using.
The storage has a precise digit count in binary, as other answers explain.
One thing to know, the CPU can run operations at a different precision internally, like 80 bits. It means that code like that can trigger :
It is more likely with more complex computation.
Typically, given any numbers in the same power-of-2 range, the floating point precision is invariant - a fixed value. The absolute precision changes with each power-of-2 step. Over the entire FP range, the precision is approximately relative to the magnitude. Relating this relative binary precision in terms of a decimal precision incurs a wobble varying between
DBL_DIG
andDBL_DECIMAL_DIG
decimal digits - Typically 15 to 17.What is precision? With FP, it makes most sense to discuss relative precision.
Floating point numbers have the form of:
They have a logarithmic distribution. There are about as many different floating point numbers between 100.0 and 3000.0 ( a range of 30x) as there are between 2.0 and 60.0. This is true regardless of the underlying storage representation.
1.23456789e100
has about the same relative precision as1.23456789e-100
.Most computers implemment
double
as binary64. This format has 53 bits of binary precision.The
n
numbers between 1.0 and 2.0 have the same absolute precision of 1 part in ((2.0-1.0)/pow(2,52).Numbers between 64.0 and 128.0, also
n
, have the same absolute precision of 1 part in ((128.0-64.0)/pow(2,52).Even group of numbers between powers of 2, have the same absolute precision.
Over the entire normal range of FP numbers, this approximates a uniform relative precision.
When these numbers are represented as decimal, the precision wobbles: Numbers 1.0 to 2.0 have 1 more bit of absolute precision than numbers 2.0 to 4.0. 2 more bits than 4.0 to 8.0, etc.
C provides
DBL_DIG
,DBL_DECIMAL_DIG
, and theirfloat
andlong double
counterparts.DBL_DIG
indicates the minimum relative decimal precision.DBL_DECIMAL_DIG
can be thought of as the maximum relative decimal precision.Typically this means given
double
will have at 15 to 17 decimal digits of precision.Consider
1.0
and its next representabledouble
, the digits do not change until the 17th significant decimal digit. Each nextdouble
ispow(2,-52)
or about2.2204e-16
apart.Now consider
"8.521812787393891"
and its next representable number as a decimal string using 16 significant decimal digits. Both of these strings, converted todouble
are the same8.521812787393891142073699...
even though they differ in the 16th digit. Saying thisdouble
had 16 digits of precision was over-stated.All modern computers use binary floating-point arithmetic. That means we have a binary mantissa, which has typically 24 bits for single precision, 53 bits for double precision and 64 bits for extended precision. (Extended precision is available on x86 processors, but not on ARM or possibly other types of processors.)
24, 53, and 64 bit mantissas mean that for a floating-point number between 2k and 2k+1 the next larger number is 2k-23, 2k-52 and 2k-63 respectively. That's the resolution. The rounding error of each floating-point operation is at most half of that.
So how does that translate into decimal numbers? It depends.
Take k = 0 and 1 ≤ x < 2. The resolution is 2-23, 2-52, and 2-63 which is about 1.19×10-7, 2.2×10-16, and 1.08×10-19 respectively. That's a bit less than 7, 16, and 19 decimals. Then take k = 3 and
8 ≤ x < 16. The difference between two floating-point numbers is now 8 times larger. For 8 ≤ x < 10 you get just over 6, less than 15, and just over 18 decimals respectively. But for 10 ≤ x < 16 you get one decimal more!
You get the highest number of decimal digits if x is only a bit less than 2k+1 and only a bit more than 10n, for example 1000 ≤ x < 1024. You get the lowest number of decimal digits if x is just a bit higher than 2k and a bit less than 10n, for example 1⁄1024 ≤ x < 1⁄1000 . The same binary precision can produce decimal precision that varies by up to 1.3 digits or log10 (2×10).
Of course, you could just read the article "What every computer scientist should know about floating-point arithmetic."