double type digits in C++

2019-05-11 16:29发布

问题:

The IEE754 (64 bits) floating point is supposed to correctly represent 15 significant digit although the internal representation has 17 ditigs. Is there a way to force the 16th and 17th digits to zero ??

Ref: http://msdn.microsoft.com/en-us/library/system.double(VS.80).aspx : . .

Remember that a floating-point number can only approximate a decimal number, and that the precision of a floating-point number determines how accurately that number approximates a decimal number. By default, a Double value contains 15 decimal digits of precision, although a maximum of 17 digits is maintained internally. The precision of a floating-point number has several consequences: . .

Example nos: d1 = 97842111437.390091
d2 = 97842111437.390076
d1 and d2 differ in 16th and 17th decimal places that are not supposed to be significant. Looking for ways to force them to zero. ie d1 = 97842111437.390000 d2 = 97842111437.390000

回答1:

No. Counter-example: the two closest floating-point numbers to a rational

1.11111111111118

(which has 15 decimal digits) are

1.1111111111111799942818834097124636173248291015625
1.1111111111111802163264883347437717020511627197265625

In other words, there is not floating-point number that starts with 1.1111111111111800.



回答2:

This question is a little malformed. The hardware stores the numbers in binary, not decimal. So in the general case you can't do precise math in base 10. Some decimal numbers (0.1 is one of them!) do not even have a non-repeating representation in binary. If you have precision requirements like this, where you care about the number being of known precision to exactly 15 decimal digits, you will need to pick another representation for your numbers.



回答3:

No, but I wonder if this is relevant to any of your issues (GCC specific):

GCC Documentation

-ffloat-store Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.

This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.



回答4:

You should be able to directly modify the bits in your number by creating a union with a field for the floating point number and an integral type of the same size. Then you can access the bits you want and set them however you want. Here is in example where I whack the sign bit; you can choose any field you want, of course.

#include <stdio.h>

union double_int {
  double             fp;
  unsigned long long integer;
};

int main(int argc, const char *argv[])
{
  double            my_double = 1325.34634;
  union double_int  *my_union = (union double_int *)&my_double;

  /* print original numbers */
  printf("Float   %f\n", my_double);
  printf("Integer %llx\n", my_union->integer);

  /* whack the sign bit to 1 */
  my_union->integer |= 1ULL << 63;

  /* print modified numbers */
  printf("Negative float   %f\n", my_double);
  printf("Negative integer %llx\n", my_union->integer);

  return 0;
}


回答5:

Generally speaking, people only care about something like this ("I only want the first x digits") when displaying the number. That's relatively easy with stringstreams or sprintf.

If you're concerned about comparing numbers with ==; you really can't do that with floating point numbers. Instead you want to see if the numbers are close enough (say, within an epsilon() of each other).

Playing with the bits of the number directly isn't a great idea.