How to check dependencies of floats

2019-01-18 22:31发布

问题:

I want to determine (in c++) if one float number is the multiplicative inverse of another float number. The problem is that i have to use a third variable to do it. For instance this code:

float x=5,y=0.2;
if(x==(1/y)) cout<<"They are the multiplicative inverse of eachother"<<endl;
else cout<<"They are NOT the multiplicative inverse of eachother"<<endl;

will output: "they are not..." which is wrong and this code:

float x=5,y=0.2,z;
z=1/y;
if(x==z) cout<<"They are the multiplicative inverse of eachother"<<endl;
else cout<<"They are NOT the multiplicative inverse of eachother"<<endl;

will output: "they are..." which is right.
why is this happening?

回答1:

The Float Precision Problem

    You have two problems here, but both come from the same root

You can't compare floats precisely. You can't subtract or divide them precisely. You can't count anything for them precisely. Any operation with them could (and almost always does) bring some error into the result. Even a=0.2f is not a precise operation. The deeper reasons of that are very well explained by the authors of the other answers here. (My thanks and votes to them for that.)

Here comes your first and more simple error. You should never, never, never, never, NEVER use on them == or its equivalent in any language.

Instead of a==b, use Abs(a-b)<HighestPossibleError instead.


    But this is not the sole problem in your task.

Abs(1/y-x)<HighestPossibleError won't work, either. At least, it won't work often enough. Why?

Let's take pair x=1000 and y=0.001. Let's take the "starting" relative error of y for 10-6.

(Relative error = error/value).

Relative errors of values are adding to at multiplication and division.

1/y is about 1000. Its relative error is the same 10-6. ("1" hasn't errors)

That makes absolute error =1000*10-6=0.001. When you subtract x later, that error will be all that remains. (Absolute errors are adding to at adding and subtracting, and the error of x is negligibly small.) Surely, you are not counting on so large errors, HighestPossibleError would be surely set lower and your program would throw off a good pair of x,y

So, the next two rule for float operations: try not to divide greater valuer by lesser one and God save you from subtracting the close values after that.

There are two simple ways to escape this problem.

  • By founding what of x,y has the greater abs value and divide 1 by the greater one and only later to subtract the lesser one.

  • If you want to compare 1/y against x, while you are working yet with letters, not values, and your operations make no errors, multiply the both sides of comparison by y and you have 1 against x*y. (Usually you should check signs in that operation, but here we use abs values, so, it is clean.) The result comparison has no division at all.

In a shorter way:

1/y V x   <=>   y*(1/y) V x*y   <=>   1 V x*y 

We already know that such comparison as 1 against x*y should be done so:

const float HighestPossibleError=1e-10;
if(Abs(x*y-1.0)<HighestPossibleError){...

That is all.


P.S. If you really need it all on one line, use:

if(Abs(x*y-1.0)<1e-10){...

But it is bad style. I wouldn't advise it.

P.P.S. In your second example the compiler optimizes the code so, that it sets z to 5 before running any code. So, checking 5 against 5 works even for floats.



回答2:

The problem is that 0.2 cannot be represented exactly in binary, because its binary expansion has an infinite number of digits:

 1/5: 0.0011001100110011001100110011001100110011...

This is similar to how 1/3 cannot be represented exactly in decimal. Since x is stored in a float which has a finite number of bits, these digits will get cut off at some point, for example:

   x: 0.0011001100110011001100110011001

The problem arises because CPUs often use a higher precision internally, so when you've just calculated 1/y, the result will have more digits, and when you load x to compare them, x will get extended to match the internal precision of the CPU.

 1/y: 0.0011001100110011001100110011001100110011001100110011
   x: 0.0011001100110011001100110011001000000000000000000000

So when you do a direct bit-by-bit comparison, they are different.

In your second example, however, storing the result into a variable means it gets truncated before doing the comparison, so comparing them at this precision, they're equal:

   x: 0.0011001100110011001100110011001
   z: 0.0011001100110011001100110011001

Many compilers have switches you can enable to force intermediate values to be truncated at every step for consistency, however the usual advice is to avoid doing direct comparisons between floating-point values and instead check if they differ by less than some epsilon value, which is what Gangnus is suggesting.



回答3:

You will have to precisely define what it means for two approximations to be multiplicative inverses. Otherwise, you won't know what it is you're supposed to be testing.

0.2 has no exact binary representation. If you store numbers that have no exact representation with limited precision, you won't get answers that are exactly correct.

The same things happens in decimal. For example, 1/3 has no exact decimal representation. You can store it as .333333. But then you have a problem. Are 3 and .333333 multiplicative inverses? If you multiply them, you get .999999. If you want the answer to be "yes" you'll have to create a test for multiplicative inverses that isn't as simple as multiplying and testing for equality to 1.

The same thing happens with binary.



回答4:

The discussions in other replies are great and so I won't repeat any of them, but there's no code. Here's a little bit of code to actually check if a pair of floats gives exactly 1.0 when multiplied.

The code makes a few assumptions/assertions (which are normally met on the x86 platform):
- float's are 32-bit binary (AKA single precision) IEEE-754
- either int's or long's are 32-bit (I decided not to rely on the availability of uint32_t)
- memcpy() copies floats to ints/longs such that 8873283.0f becomes 0x4B076543 (i.e. certain "endianness" is expected)

One extra assumption is this:
- it receives the actual floats that * would multiply (i.e. multiplication of floats wouldn't use higher precision values that the math hardware/library can use internally)

#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <assert.h>

#define C_ASSERT(expr) extern char CAssertExtern[(expr)?1:-1]

#if UINT_MAX >= 0xFFFFFFFF
typedef unsigned int uint32;
#else
typedef unsigned long uint32;
#endif
typedef unsigned long long uint64;

C_ASSERT(CHAR_BIT == 8);
C_ASSERT(sizeof(uint32) == 4);
C_ASSERT(sizeof(float) == 4);

int ProductIsOne(float f1, float f2)
{
  uint32 m1, m2;
  int e1, e2, s1, s2;
  int e;
  uint64 m;

  // Make sure floats are 32-bit IEE754 and
  // reinterpreted as integers as we expect
  {
    static const float testf = 8873283.0f;
    uint32 testi;
    memcpy(&testi, &testf, sizeof(testf));
    assert(testi == 0x4B076543);
  }

  memcpy(&m1, &f1, sizeof(f1));
  s1 = m1 >= 0x80000000;
  m1 &= 0x7FFFFFFF;
  e1 = m1 >> 23;
  m1 &= 0x7FFFFF;
  if (e1 > 0) m1 |= 0x800000;

  memcpy(&m2, &f2, sizeof(f2));
  s2 = m2 >= 0x80000000;
  m2 &= 0x7FFFFFFF;
  e2 = m2 >> 23;
  m2 &= 0x7FFFFF;
  if (e2 > 0) m2 |= 0x800000;

  if (e1 == 0xFF || e2 == 0xFF || s1 != s2) // Inf, NaN, different signs
    return 0;

  m = (uint64)m1 * m2;

  if (!m || (m & (m - 1))) // not a power of 2
    return 0;

  e = e1 + !e1 - 0x7F - 23 + e2 + !e2 - 0x7F - 23;
  while (m > 1) m >>= 1, e++;

  return e == 0;
}

const float testData[][2] =
{
  { .1f, 10.0f },
  { 0.5f, 2.0f },
  { 0.25f, 2.0f },
  { 4.0f, 0.25f },
  { 0.33333333f, 3.0f },
  { 0.00000762939453125f, 131072.0f }, // 2^-17 * 2^17
  { 1.26765060022822940E30f, 7.88860905221011805E-31f }, // 2^100 * 2^-100
  { 5.87747175411143754E-39f, 1.70141183460469232E38f }, // 2^-127 (denormalized) * 2^127
};

int main(void)
{
  int i;
  for (i = 0; i < sizeof(testData) / sizeof(testData[0]); i++)
    printf("%g * %g %c= 1\n",
           testData[i][0], testData[i][1],
           "!="[ProductIsOne(testData[i][0], testData[i][1])]);
  return 0;
}

Output (see at ideone.com):

0.1 * 10 != 1
0.5 * 2 == 1
0.25 * 2 != 1
4 * 0.25 == 1
0.333333 * 3 != 1
7.62939e-06 * 131072 == 1
1.26765e+30 * 7.88861e-31 == 1
5.87747e-39 * 1.70141e+38 == 1


回答5:

What is striking is that whatever the rounding rule is, you expect the outcome of the two versions to be the same (either twice wrong or twice right)!

Most probably, in the first case a promotion to higher accuracy in the FPU registers takes place when evaluating x==1/y, whereas z= 1/y really stores the single-precision result.

Other contributors have explaine why 5==1/0.2 can fail, I needn't repeat that.