Comparing uint64_t and float for numeric equivalen

2020-04-11 16:22发布

问题:

I am writing a protocol, that uses RFC 7049 as its binary representation. The standard states, that the protocol may use 32-bit floating point representation of numbers, if their numeric value is equivalent to respective 64-bit numbers. The conversion must not lead to lose of precision.

  • What 32-bit float numbers can be bigger than 64-bit integer and numerically equivalent with them?
  • Is comparing float x; uint64_t y; (float)x == (float)y enough for ensuring, that the values are equivalent? Will this comparison ever be true?

RFC 7049 §3.6. Numbers

For the purposes of this specification, all number representations for the same numeric value are equivalent. This means that an encoder can encode a floating-point value of 0.0 as the integer 0. It, however, also means that an application that expects to find integer values only might find floating-point values if the encoder decides these are desirable, such as when the floating-point value is more compact than a 64-bit integer.

回答1:

There certainly are numbers for which this is true:

2^33 can be perfectly represented as a floating point number, but clearly cannot be represented as a 32-bit integer. The following code should work as expected:

bool representable_as_float(int64_t value) {
    float repr = value;
    return repr >= -0x1.0p63 && repr < 0x1.0p63 && (int64_t)repr == value;
}

It is important to notice though that we are basically doing (int64_t)(float)value and not the other way around - we are interested if the cast to float loses any precision.

The check to see whether repr is smaller than the maximum value of int64_t is important since we could invoke undefined behavior otherwise, since the cast to float may round up to the next higher number (which could then be larger than the maximum value possible in int64_t). (Thanks to @tmyklebu for pointing this out).

Two samples:

// powers of 2 can easily be represented
assert(representable_as_float(((int64_t)1) << 33));
// Other numbers not so much:
assert(!representable_as_float(std::numeric_limits<int64_t>::max())); 


回答2:

The following is based on Julia's method for comparing floats and integers. This does not require access to 80-bit long doubles or floating point exceptions, and should work under any rounding mode. I believe this should work for any C float type (IEEE754 or not), and not cause any undefined behaviour.

UPDATE: technically this assumes a binary float format, and that the float exponent size is large enough to represent 264: this is certainly true for the standard IEEE754 binary32 (which you refer to in your question), but not, say, binary16.

#include <stdio.h>
#include <stdint.h>

int cmp_flt_uint64(float x,uint64_t y) {
  return (x == (float)y) && (x != 0x1p64f) && ((uint64_t)x == y);
}

int main() {
  float x = 0x1p64f;
  uint64_t y = 0xffffffffffffffff;

  if (cmp_flt_uint64(x,y))
    printf("true\n");
  else 
    printf("false\n");
  ;
}

The logic here is as follows:

  • The first equality can be true only if x is a non-negative integer in the interval [0,264].
  • The second checks that x (and hence (float)y) is not 264: if this is the case, then y cannot be represented exactly by a float, and so the comparison is false.
  • Any remaining values of x can be exactly converted to a uint64_t, and so we cast and compare.


回答3:

No, you need to compare (long double)x == (long double)y on an architecture where the mantissa of a long double can hold 63 bits. This is because some big long long ints will lose precision when you convert them to float, and compare as equal to a non-equivalent float, but if you convert to long double, it will not lose precision on that architecture.

The following program demonstrates this behavior when compiled with gcc -std=c99 -mssse3 -mfpmath=sse on x86, because these settings use wide-enough long doubles but prevent the implicit use of higher-precision types in calculations:

#include <assert.h>
#include <stdint.h>

const int64_t x = (1ULL<<62) - 1ULL;
const float y = (float)(1ULL<<62);
// The mantissa is not wide enough to store
// 63 bits of precision.

int main(void)
{
  assert ((float)x == (float)y);
  assert ((long double)x != (long double)y);

  return 0;
}

Edit: If you don’t have wide enough long doubles, the following might work:

feclearexcept(FE_ALL_EXCEPT);
x == y;
ftestexcept(FE_INEXACT);

I think, although I could be mistaken, that an implementation could round off x during the conversion in a way that loses precision.

Another strategy that could work is to compare

extern uint64_t x;
extern float y;
const float z = (float)x;

y == z && (uint64_t)z == x;

This should catch losses of precision due to round-off error, but it could conceivably cause undefined behavior if the conversion to z rounds up. It will work if the conversion is set to round toward zero when converting x to z.