I'm learning more then I ever wanted to know about Floating point numbers.
Lets say I needed to add:
1 10000000 00000000000000000000000
1 01111000 11111000000000000000000
2’s complement form.
The first bit is the sign, the next 8 bits are the exponent and the last 23 bits are the mantisa.
Without doing a conversion to scientific notation, how do I add these two numbers? Can you walk through it step by step?
any good resources for this stuff? Videos and practice examples would be great.
You have to scale the numbers so that they have the same exponent. Then you add the mantissa fields and, if necessary, normalise the result.
Oh, yes, and if they're different signs, you just call your subtraction function instead :-)
Let's do an example in decimal since it's easier to understand. Let's further assume they're stored with only eight digits to the right of the decimal (and the numbers are between 0 inclusive and 1 exclusive).
Add the two numbers:
Scaling these numbers to the highest exponent gives something where you can add the mantissa fields.:
And there you have your number. This also illustrates how accuracy can be lost due to the shifting. For example, IEEE754 single precision floats will have:
such as with:
In terms of what happens with overflow, that's part of the normalisation I mentioned. Let's add
99999.9999
to99999.9993
. Since they already have the same exponent, no need to scale, so we just add:You can see here that we have a carry situation so we can't put that carry into the number, being limited to eight digits. What we do then is to shift the number to the right so that we can insert the carry. Since that shift is effectively a divide-by-ten, we have to increment the exponent to counter that.
So:
becomes:
In reality, it's not just a simple right-shift since you need to round to the nearest number. If the number you're shifting out is five or more, you need to add one to the digit on the left. That's why I chose
99999.9993
as the second number. If I had added99999.9999
to itself, I would have ended up with:which, on right shift, would have triggered quite a few carries towards the left: