IEEE 754: How exactly does it work?

2020-07-18 01:54发布

问题:

Why does the following code behave as it does in C?

float x = 2147483647; //2^31
printf("%f\n", x); //Outputs 2147483648

Here is my thought process:

2147483647 =   0      1001 1101      1111 1111 1111 1111 1111 111

   (0.11111111111111111111111)base2 = (1-(0.5)^23)base10
=> (1.11111111111111111111111)base2 = (1 + 1-(0.5)^23)base10 = (1.99999988)base10

Therefore, to convert the IEEE 754 notation back to decimal: 1.99999988 * 2^30 = 2147483520

So technically, the C program must have printed out 2147483520, right?

回答1:

The value to be represented would be 2147483647. the next two values which can be represented this way are 2147483520 and 2147483648.

As the latter is closer to the unrepresentable "ideal one", it gets used: in floating point, the values get rounded, not truncated.



回答2:

The standard is available here. You might have to purchase it, as IEEE (and other organizations like it) mainly make their money by selling the standard, to defray their costs in assembling, lobbying for acceptance, and improving the quality of the standard.

The bits only mean what someone designates them to be

"When I use a word," Humpty Dumpty said in rather a scornful tone, "it means just what I choose it to mean -- neither more nor less." "The question is," said Alice, "whether you can make words mean so many different things." "The question is," said Humpty Dumpty, "which is to be master - - that's all." (Through the Looking Glass, Chapter 6)

In this case IEEE has decided what the bits mean, and the reason that the printf flag %f prints out the right corresponding human representation is due to the flag also following the same standard.

Occasionally you can manage to cast the bits into another data type (like an int) and print out the "other" representation of those bits. C will catch a lot of the normal number promotions, but you can confuse it, generally with the assistance of assigning pointer of the wrong type to the correct address (and dereferencing them).

Note that while you are doing the math by hand, the actual hardware isn't guaranteed to do the math exactly as you would. With integer math there is much more accuracy in the representation, but with floating point math, how you round a number makes a big difference in the output. That's not even mentioning the floating point errors which sometimes were burned into systems (thankfully not often).



回答3:

Floating point formats are often in a "normalized form" where the most significant bit of the mantissa is always 1. Since it's always 1, you don't need to use up a bit to store it. So when decoding such a number representation, you'll need to add back the 1 at the top.



回答4:

2147483647 = 2^31 - 1 = +1 * 2^30 * 1.1111 1111 1111 1111 1111 1111 1111 11

When encoding this number in the IEEE 754-1985 single precision format, the significand is rounded properly. For the rounding mode round to nearest even (the default rounding mode) this means it gets rounded up.

Before rounding:

exponent = 30, significand = 1.1111 1111 1111 1111 1111 1111 1111 11

After rounding the significand to 23 digits after the decimal point:

exponent = 30, significand = 10.0000 0000 0000 0000 0000 000

After normalizing:

exponent = 31, significand = 1.0

Encoded in the single precision format:

1 | 10011110 | 00000000000000000000000