Is floating point precision mutable or invariant?

2019-02-02 06:31发布

I keep getting mixed answers of whether floating point numbers (i.e. float, double, or long double) have one and only one value of precision, or have a precision value which can vary.

One topic called float vs. double precision seems to imply that floating point precision is an absolute.

However, another topic called Difference between float and double says,

In general a double has 15 to 16 decimal digits of precision

Another source says,

Variables of type float typically have a precision of about 7 significant digits

Variables of type double typically have a precision of about 16 significant digits

I don't like to refer to approximations like the above if I'm working with sensitive code that can break easily when my values are not exact. So let's set the record straight. Is floating point precision mutable or invariant, and why?

10条回答
太酷不给撩
2楼-- · 2019-02-02 06:51

The amount of space required to store a float will be constant, and likewise a double; the amount of useful precision will in relative terms generally vary, however, between one part in 223 and one part in 224 for float, or one part in 252 and 253 for double. Precision very near zero isn't that good, with the second-smallest positive value being twice as big as the smallest, which will in turn be infinitely greater than zero. Throughout the most of the range, however, precision will vary as described above.

Note that while it often isn't practical to have types whose relative precision varies by less than a factor of two throughout its range, the variation in precision can sometimes cause calculations to yield much less accurate calculations than it would appear they should. Consider, for example, 16777215.0f + 4.0f - 4.0f. All of the values would be precisely representable as float using the same scale, and the nearest values to the large one are +/- one part in 16,777,215, but the first addition yields a result in part of the float range where values are separated by one part in only 8,388,610, causing the result to be rounded to 16,777,220. Consequently, subtracting 4 yields 16,777,216 rather than 16,777,215. For most values of float near 16777216, adding 4.0f and subtracting 4.0f would yield the original value unchanged, but the changing precision right at the break-over point causes the result to be off by an extra bit in the lowest place.

查看更多
对你真心纯属浪费
3楼-- · 2019-02-02 06:52

I'm going to add the off-beat answer here, and say that since you've tagged this question as C++, there is no guarantee whatsoever about precision of floating point data. The vast majority of implementations use IEEE-754 when implementing their floating point types, but that is not required. The only thing required by the C++ language is that (C++ spec §3.9.1.8):

There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum and minimum values of each arithmetic type for an implementation.
查看更多
迷人小祖宗
4楼-- · 2019-02-02 06:53

Well the answer to this is simple but complicated. These numbers are stored in binary. Depending on if it is a float or a double, the computer uses different amounts of binary to store the number. The precision that you get depends on your binary. If you don't know how binary numbers work, it would be a good idea to look it up. But simply put, some numbers need more ones and zeros than other numbers.

So the precision is fixed (same number of binary digits), but the actual precision that you get depends on the numbers that you are using.

查看更多
Explosion°爆炸
5楼-- · 2019-02-02 06:55

The storage has a precise digit count in binary, as other answers explain.

One thing to know, the CPU can run operations at a different precision internally, like 80 bits. It means that code like that can trigger :

void Kaboom( float a, float b, float c ) // same is true for other floating point types.
{
    float sum1 = a+b+c;
    float sum2 = a+b;
    sum2 += c; // let's assume that the compiler did not keep sum2 in a register and the value was write to memory then load again.
    if (sum1 !=sum2)
        throw "kaboom"; // this can happen.
}

It is more likely with more complex computation.

查看更多
戒情不戒烟
6楼-- · 2019-02-02 06:59

Is floating point precision mutable or invariant, and why?

Typically, given any numbers in the same power-of-2 range, the floating point precision is invariant - a fixed value. The absolute precision changes with each power-of-2 step. Over the entire FP range, the precision is approximately relative to the magnitude. Relating this relative binary precision in terms of a decimal precision incurs a wobble varying between DBL_DIG and DBL_DECIMAL_DIG decimal digits - Typically 15 to 17.


What is precision? With FP, it makes most sense to discuss relative precision.

Floating point numbers have the form of:

Sign * Significand * pow(base,exponent)

They have a logarithmic distribution. There are about as many different floating point numbers between 100.0 and 3000.0 ( a range of 30x) as there are between 2.0 and 60.0. This is true regardless of the underlying storage representation.

1.23456789e100 has about the same relative precision as 1.23456789e-100.


Most computers implemment double as binary64. This format has 53 bits of binary precision.

The n numbers between 1.0 and 2.0 have the same absolute precision of 1 part in ((2.0-1.0)/pow(2,52).
Numbers between 64.0 and 128.0, also n, have the same absolute precision of 1 part in ((128.0-64.0)/pow(2,52).

Even group of numbers between powers of 2, have the same absolute precision.

Over the entire normal range of FP numbers, this approximates a uniform relative precision.

When these numbers are represented as decimal, the precision wobbles: Numbers 1.0 to 2.0 have 1 more bit of absolute precision than numbers 2.0 to 4.0. 2 more bits than 4.0 to 8.0, etc.

C provides DBL_DIG, DBL_DECIMAL_DIG, and their float and long double counterparts. DBL_DIG indicates the minimum relative decimal precision. DBL_DECIMAL_DIG can be thought of as the maximum relative decimal precision.

Typically this means given double will have at 15 to 17 decimal digits of precision.

Consider 1.0and its next representable double, the digits do not change until the 17th significant decimal digit. Each next double is pow(2,-52) or about 2.2204e-16 apart.

/*
1 234567890123456789 */
1.000000000000000000...
1.000000000000000222...

Now consider "8.521812787393891"and its next representable number as a decimal string using 16 significant decimal digits. Both of these strings, converted to double are the same 8.521812787393891142073699... even though they differ in the 16th digit. Saying this double had 16 digits of precision was over-stated.

/*
1 234567890123456789 */
8.521812787393891
8.521812787393891142073699...
8.521812787393892
查看更多
不美不萌又怎样
7楼-- · 2019-02-02 07:00

All modern computers use binary floating-point arithmetic. That means we have a binary mantissa, which has typically 24 bits for single precision, 53 bits for double precision and 64 bits for extended precision. (Extended precision is available on x86 processors, but not on ARM or possibly other types of processors.)

24, 53, and 64 bit mantissas mean that for a floating-point number between 2k and 2k+1 the next larger number is 2k-23, 2k-52 and 2k-63 respectively. That's the resolution. The rounding error of each floating-point operation is at most half of that.

So how does that translate into decimal numbers? It depends.

Take k = 0 and 1 ≤ x < 2. The resolution is 2-23, 2-52, and 2-63 which is about 1.19×10-7, 2.2×10-16, and 1.08×10-19 respectively. That's a bit less than 7, 16, and 19 decimals. Then take k = 3 and
8 ≤ x < 16. The difference between two floating-point numbers is now 8 times larger. For 8 ≤ x < 10 you get just over 6, less than 15, and just over 18 decimals respectively. But for 10 ≤ x < 16 you get one decimal more!

You get the highest number of decimal digits if x is only a bit less than 2k+1 and only a bit more than 10n, for example 1000 ≤ x < 1024. You get the lowest number of decimal digits if x is just a bit higher than 2k and a bit less than 10n, for example 11024 ≤ x < 11000 . The same binary precision can produce decimal precision that varies by up to 1.3 digits or log10 (2×10).

Of course, you could just read the article "What every computer scientist should know about floating-point arithmetic."

查看更多
登录 后发表回答