What types of numbers are representable in binary

2019-01-08 13:58发布

问题:

I've read a lot about floats, but it's all unnecessarily involved. I think I've got it pretty much understood, but there's just one thing I'd like to know for sure:

I know that, fractions of the form 1/pow(2,n), with n an integer, can be represented exactly in floating point numbers. This means that if I add 1/32 to itself 32 million times, I would get exactly 1,000,000.

What about something like 1/(32+16)? It's one over the sum of two powers of two, does this work? Or is it 1/32+1/16 that works? This is where I'm confused, so if anyone could clarify that for me I would appreciate it.

回答1:

The rule can be summed up as this:

  • A number can be represented exactly in binary if the prime factorization of the denominator contains only 2. (i.e. the denominator is a power-of-two)

So 1/(32 + 16) is not representable in binary because it has a factor of 3 in the denominator. But 1/32 + 1/16 = 3/32 is.

That said, there are more restrictions to be representable in a floating-point type. For example, you only have 53 bits of mantissa in an IEEE double so 1/2 + 1/2^500 is not representable.

So you can do sum of powers-of-two as long as the range of the exponents doesn't span more than 53 powers.


To generalize this to other bases:

  • A number can be exactly represented in base 10 if the prime factorization of the denominator consists of only 2's and 5's.

  • A rational number X can be exactly represented in base N if the prime factorization of the denominator of X contains only primes found in the factorization of N.



回答2:

A finite number can be represented in the common IEEE 754 double-precision format if and only if it equals M•2e for some integers M and e such that -253 < M < 253 and -1074 ≤ e ≤ 971.

For single precision, -224 < M < 224 and -149 ≤ e ≤ 104.

For double-precision, these are consequences of the facts that the double-precision format uses 52 bits to store a significand (which normally has 53 bits due to an implicit 1) and uses 11 bits to store an exponent. 11 bits encodes numbers from 0 to 2047, but 0 and 2047 are excluded for special purposes, and the encoded number is biased by 1023, so it represents unbiased exponents from -1022 to 1023. However, these unbiased exponents are for significands in the interval [1, 2), and those significands have fractions. To express the significand as an integer, I adjusted the exponent range by 52. Single-precision is similar, with 23 bits to store a 24-bit significand, 8 bits for the exponent, and a bias of 127.

Expressing the representable numbers using an integer times a power of two rather than the more common fractional significand simplifies some number theory and other reasoning about floating-point properties. I used it in this answer because it allows the set of representable values to be expressed concisely.



回答3:

Floating-point numbers are literally represented using the form:

1.m * 2^e

Where 1.m is a binary fraction and e is a positive or negative integer.

As such, you can represent 1/32 + 1/16 exactly, as:

1.1000000 * 2^-4

(1.10 being the binary fraction equivalent to 1.5.) 1/48, however, is not representable in this format.



回答4:

One point not yet mentioned is that semantically, a floating-point number may be best regarded as representing a range of values. The range of values has a very precisely-defined center point, and the IEEE spec generally requires that the result of a floating-point computation be the number whose range contains the point one would get operating upon the center-points of the original numbers, but in the sequence:

  double N1 = 0.1;
  float  N2 = (float)N1;
  double N3 = N2;

N2 is the unambiguous correct single-precision representation of the value that had been represented in N1, despite the language's silly requirement to use an explicit cast. N3 will represent one of the values that N2 could represent (the language spec happens to choose the double value whose range is centered upon the middle of the range of the float). Note that while N2 represents the value of its type whose range contains the correct value, N3 does not.

Incidentally, conversion of a number from a string to a float in .net and .net languages seems to go through an intermediate conversion to double, which may sometimes alter the value. For example, even though the value 13571357 is representable as a single-precision float, the value 13571357.499999999069f gets rounded to 13571358 (even though it's obviously closer to 13571357).