Some questions about floating points

2019-08-06 03:15发布

I'm wondering if a number is represented one way in a floating point representation, is it going to be represented in the same way in a representation that has a larger size. That is, if a number has a particular representation as a float, will it have the same representation if that float is cast to a double and then still the same when cast to a long double.

I'm wondering because I'm writing a BigInteger implementation and any floating point number that is passed in I am sending to a function that accepts a long double to convert it. Which leads me to my next question. Obviously floating points do not always have exact representations, so in my BigInteger class what should I be attempting to represent when given a float. Is it reasonable to try and represent the same number as given by std::cout << std::fixed << someFloat; even if that is not the same as the number passed in. Is that the most accurate representation I will be able to get? If so, ...

What's the best way to extract that value (in base some power of 10), at the moment I'm just grabbing it as a string and passing it to my string constructor. This will work, but I can't help but feel theres a better way, but certainly taking the remainder when dividing by my base is not accurate with floats.

Finally, I wonder if there is a floating point equivalent of uintmax_t, that is a typename that will always be the largest floating point type on a system, or is there no point because long double will always be the largest (even if it 's the same as a double).

Thanks, T.

3条回答
可以哭但决不认输i
2楼-- · 2019-08-06 03:42

yes, going from IEEE float to double to extended you will see bits from the smaller format to the larger format, for example

single
S EEEEEEEE MMMMMMM.....
double 
S EEEEEEEEEEEE MMMMM....

6.5 single
0 10000001 101000...
6.5 double
0 10000000001 101000...
13 single
0 10000010 101000...
13 double
0 10000000010 101000...

The mantissa you will left justify and then add zeros.

The exponent is right justified, sign extend the next to msbit then copy the msbit.

An exponent of -2 for example. take -2 subtract 1 which is -3. -3 in twos complement is 0xFD or 0b11111101 but the exponent bits in the format are 0b01111101, the msbit inverted. And for double a -2 exponent -2-1 = -3. or 0b1111...1101 and that becomes 0b0111...1101, the msbit inverted. (exponent bits = twos_complement(exponent-1) with the msbit inverted).

As we see above an exponent of 3 3-1 = 2 0b000...010 invert the upper bit 0b100...010

So yes you can take the bits from single precision and copy them to the proper locations in the double precision number. I dont have an extended float reference handy but pretty sure it works the same way.

查看更多
我只想做你的唯一
3楼-- · 2019-08-06 03:49

If by "same representation" you mean "exactly the same binary representation in memory except for padding", then no. Double-precision has more bits of both exponent and mantissa, and also has a different exponent bias. But I believe that any single-precision value is exactly representable in double-precision (except possibly denormalised values).

I'm not sure what you mean when you say "floating points do not always have exact representations". Certainly, not all decimal floating-point values have exact binary floating-point values (and vice versa), but I'm not sure that's a problem here. So long as your floating-point input has no fractional part, then a suitably large "BigInteger" format should be able to represent it exactly.

Conversion via a base-10 representation is not the way to go. In theory, all you need is a bit-array of length ~1024, initialise it all to zero, and then shift the mantissa bits in by the exponent value. But without knowing more about your implementation, there's not a lot more I can suggest!

查看更多
【Aperson】
4楼-- · 2019-08-06 03:55

double includes all values of float; long double includes all values of double. So you're not losing any value information by conversion to long double. However, you're losing information about the original type, which is relevant (see below).

In order to follow common C++ semantics, conversion of a floating point value to integer should truncate the value, not round.

The main problem is with large values that are not exact. You can use the frexp function to find the base 2 exponent of the floating point value. You can use std::numeric_limits<T>::digits to check if that's within the integer range that can be exactly represented.

My personal design choice would be an assert that the fp value is within the range that can be exactly represented, i.e. a restriction on the range of any actual argument.

To do that properly you need overloads taking float and double arguments, since the range that can be represented exactly depends on the actual argument's type.

When you have an fp value that is within the allowed range, you can use floor and fmod to extract digits in any numeral system you want.

查看更多
登录 后发表回答