Adding 32 bit floating point numbers.

I'm learning more then I ever wanted to know about Floating point numbers.

Lets say I needed to add:

1 10000000 00000000000000000000000

1 01111000 11111000000000000000000

2’s complement form.

The first bit is the sign, the next 8 bits are the exponent and the last 23 bits are the mantisa.

Without doing a conversion to scientific notation, how do I add these two numbers? Can you walk through it step by step?

any good resources for this stuff? Videos and practice examples would be great.

标签： floating-point 32-bit floating-point-precision

1条回答

劫难

2楼-- · 2019-05-11 09:41

You have to scale the numbers so that they have the same exponent. Then you add the mantissa fields and, if necessary, normalise the result.

Oh, yes, and if they're different signs, you just call your subtraction function instead :-)

Let's do an example in decimal since it's easier to understand. Let's further assume they're stored with only eight digits to the right of the decimal (and the numbers are between 0 inclusive and 1 exclusive).

Add the two numbers:

sign  exponent  mantissa  value
   1        42  18453284  + 0.18453284 x 10^42
   1        38  17654321  + 0.17654321 x 10^38

Scaling these numbers to the highest exponent gives something where you can add the mantissa fields.:

sign  exponent  mantissa  value
   1        42  18453284  + 0.18453284 x 10^42
   1        42      1765  + 0.00001765 x 10^42
   =        ==  ========
   1        42  18455049  + 0.18455049 x 10^42

And there you have your number. This also illustrates how accuracy can be lost due to the shifting. For example, IEEE754 single precision floats will have:

1e38 + 1e-38 = 1e38

such as with:

#include <stdio.h>
int main (void) {
    float f1 = 1e38;
    float f2 = 1e-38;
    float f3 = f1 + f2;
    float f4 = f1 - f3;
    printf ("%.50f\n", f4);
    return 0;
}

In terms of what happens with overflow, that's part of the normalisation I mentioned. Let's add 99999.9999 to 99999.9993. Since they already have the same exponent, no need to scale, so we just add:

sign  exponent  mantissa  value
   1         5  99999999  + 0.99999999 x 10^5
   1         5  99999993  + 0.99999999 x 10^5
   =        ==  ========
   1         5 199999992  ???

You can see here that we have a carry situation so we can't put that carry into the number, being limited to eight digits. What we do then is to shift the number to the right so that we can insert the carry. Since that shift is effectively a divide-by-ten, we have to increment the exponent to counter that.

So:

sign  exponent  mantissa  value
   1         5 199999992  ???

becomes:

sign  exponent  mantissa  value
   1         6  19999999  + 0.19999999 x 10^6

In reality, it's not just a simple right-shift since you need to round to the nearest number. If the number you're shifting out is five or more, you need to add one to the digit on the left. That's why I chose 99999.9993 as the second number. If I had added 99999.9999 to itself, I would have ended up with:

sign  exponent  mantissa  value
   1         5 199999998  ???

which, on right shift, would have triggered quite a few carries towards the left:

sign  exponent  mantissa  value
   1         6  20000000  + 0.2 x 10^6

0人赞添加讨论(0) 举报

Adding 32 bit floating point numbers.

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间