C++ I've just read that floats are inexact and

2019-05-23 19:04发布

问题:

I am thinking of this at a binary level.

would a float of value 1 and an integer of value 1 not compile down to (omitting lots of zeros here)

0001

If they do both compile down to this then where does this inexactness come in.

Resource I'm using is http://www.cprogramming.com/tutorial/lesson1.html

Thanks.

回答1:

It's possible. Floating point numbers are represented in an exponential notation (a*2^n), where some bits represent a (the significand), and some bits represent n (the exponent).

You can't uniquely represent all the integers in the range of a floating point value, due to the so-called pigeonhole principle. For example, 32-bit floats go up to over 10^38, but on 32 bits you can only represent 2^32 values - that means some integers will have the same representation.

Now, what happens when you try to, for example, do the following:

x = 10^38 - (10^38 - 1)

You should get 1, but you probably won't, because 10^38 and 10^38-1 are so close to each other that the computer has to represent them the same way. So, your 1.0f will usually be 1, but if this 1 is a result of calculation, it might not be.

Here are some examples.



回答2:

To be precise: Integers can be exactly represented as floats if their binary representation does not use more bits than the float format supplies for the mantissa plus an implicit one bit.

IEEE floats have a mantissa of 23 bits, add one implicit bit, and you can store any integer representable with 24 bits in a float (that's integers up to 16777216). Likewise, a double has 52 mantissa bits, so it can store integers up to 9007199254740992.

Beyond that point, the IEEE format omits first the odd numbers, then all numbers not divisible by 4, and so on. So, even 0xffffff00ul is exactly representable as a float, but 0xffffff01ul is not.

So, yes, you can represent integers as floats, and as long as they don't become larger than the 16e6 or 9e15 limits, you can even expect additions between integers in float format to be exact.



回答3:

Short answer: no, the floating point representation of integers is not that simple.


The representation adopted for the float type by the C language standard is called IEEE 754 single-precision and is probably more complicated than most people would like to delve into, but the link describes it thoroughly in case you're interested.

As for the representation of the integer 1: we can see how it's encoded in the 32-bit base-2 single-precision format defined by IEEE 754 here - 3f80 0000.



回答4:

A float will store an int exactly if the int is less than a certain number, but if you have a large enough int, there won't be enough bits in the mantissa to store all the bits of the integer. The missing bits are then assumed to be zero. If the missing bits aren't zero, then your int won't be equal to your float.



回答5:

Suppose letters stand for a bit, 0/1. Then a floating point number looks (schematically) like:

smmmmee

where s is the sign +/- and the number is .mmmm x 10 ^ ee

Now if you have two immediately following numbers:

.mmm0 x 10 ^ ee
.mmm1 x 10 ^ ee

Then for large exponent ee the difference might be more then 1.

And of course in base 2 a number like 1/5, 0.2, cannot represented exact. Summing fractions wil increase the error.

(Note this is not the exact representation.)



回答6:

would a float of value 1 and an integer of value 1 not compile down to (omitting lots of zeros here) 0001

No, float will be stored like something similar to 0x00000803f, depending on precision.

What does this mean?

  1. Some numbers cannot be precisely represented in binary form. O.2 in binary form will look like 0.00110011001100110011... which will keep going on(and repeating) forever. No matter how many bits you use to store it, it will be never enough. That's because 5 is not divisible by 2. The only way to precisely represent it is to use ratios to store it.
  2. floating points have limited precision. Roughly speaking, they only store certain amount of digits after first significant non-zero digit, and the rest will be lost. That'll result in errors, for example, with single precision floats 100000000000000001 and 100000000000000002 are most likely rounded off to the same number.

You might also want to read something like this.

Conclusion:

If you're writing financial software, do not use floats. Use Bignums, using libraries like gmp



回答7:

Contrary to some modern dynamically typed programming languages such as JavaScript or Ruby that have a single basic numeric type, the C programming language has many. That is because C reflects the different way to represent different kinds of numbers within a processor register.

To investigate different representations you can use the union construct where the same data can be viewed as different types.

Define

union {
  float x;
  int v;
} u;

Assign u.x = 1.0f and printf("0x%08x\n",u.v) to get the 32-bit representation of 1.0f as a floating point number. It should return 0x3f800000 and not 0x00000001 as one might expect.

As mentioned in earlier answers this reflects the representation of a floating number as a 32-bit value as `

    1.0f = 0x3F800000 = 0011.1111.1000.0000.0000.0000.0000.0000 =
                 0 0111.1111 000.0000.0000.0000.0000.0000 = 0 0x7F 0

Here the three parts are sign s=0, exponent e=127, and mantissa m=0 and the floating point value is computed as

   value = s * (1 + m * 2^-23) * 2^(e-127) 

With this representation any integer number from -16,777,215 to 16,777,215 can be represented exactly. This is the value of (2^24 - 1) since there are only 23 bits in the mantissa. This range is not sufficient for many applications, therefore the float type cannot replace the int type.

The range of exact representation of integers by the double type is wider since the value occupies 64 bits and there are 53 bits reserved for the mantissa. It is exactly from -9,007,199,254,740,991 to 9,007,199,254,740,991. Yet double requires twice as much memory.

Another source of difficulty is the way fractional numbers are represented. Since decimal fractions cannot be represented exactly (0.1f = 0x3dcccccd = 0.10000000149...) the use of floating point numbers breaks common algebraic identities.

0.1f * 10 != 1.0f

This can be confusing and lead to errors that are hard to detect. In general strict equality should not be used with floating point numbers.

Another example of floating point arithmetic depature from algebraic correctness:

float x = 16777217.0f;
float y = 16777215.0f;
x -= 1.0f;
y += 1.0f;
if (y > x) {printf("16777215.0 + 1.0 > 16777217.0 - 1.0\n");}

Yet another issue is the behaviour of the system when the limits of exact representation are broken. When in integer arithmetic the result of an arithmetic operation is greater than the range of the type, this can be detected in many ways: a special OVERFLOW bit in the processor flags register is flipped, and the result is significantly different from the expected.

In floating point arithmetic as the example above shows, the loss of precision occurs silently.

Hope this helps to understand why one needs many basic numeric types in C.