I am writing a protocol, that uses RFC 7049 as its binary representation. The standard states, that the protocol may use 32-bit floating point representation of numbers, if their numeric value is equivalent to respective 64-bit numbers. The conversion must not lead to lose of precision.
- What 32-bit float numbers can be bigger than 64-bit integer and numerically equivalent with them?
- Is comparing
float x; uint64_t y; (float)x == (float)y
enough for ensuring, that the values are equivalent? Will this comparison ever be true?
RFC 7049 §3.6. Numbers
For the purposes of this specification, all number representations
for the same numeric value are equivalent. This means that an
encoder can encode a floating-point value of 0.0 as the integer 0.
It, however, also means that an application that expects to find
integer values only might find floating-point values if the encoder
decides these are desirable, such as when the floating-point value is
more compact than a 64-bit integer.
There certainly are numbers for which this is true:
2^33 can be perfectly represented as a floating point number, but clearly cannot be represented as a 32-bit integer. The following code should work as expected:
bool representable_as_float(int64_t value) {
float repr = value;
return repr >= -0x1.0p63 && repr < 0x1.0p63 && (int64_t)repr == value;
}
It is important to notice though that we are basically doing (int64_t)(float)value and not the other way around - we are interested if the cast to float loses any precision.
The check to see whether repr is smaller than the maximum value of int64_t is important since we could invoke undefined behavior otherwise, since the cast to float may round up to the next higher number (which could then be larger than the maximum value possible in int64_t). (Thanks to @tmyklebu for pointing this out).
Two samples:
// powers of 2 can easily be represented
assert(representable_as_float(((int64_t)1) << 33));
// Other numbers not so much:
assert(!representable_as_float(std::numeric_limits<int64_t>::max()));
The following is based on Julia's method for comparing floats and integers. This does not require access to 80-bit long double
s or floating point exceptions, and should work under any rounding mode. I believe this should work for any C float
type (IEEE754 or not), and not cause any undefined behaviour.
UPDATE: technically this assumes a binary float
format, and that the float
exponent size is large enough to represent 264: this is certainly true for the standard IEEE754 binary32 (which you refer to in your question), but not, say, binary16.
#include <stdio.h>
#include <stdint.h>
int cmp_flt_uint64(float x,uint64_t y) {
return (x == (float)y) && (x != 0x1p64f) && ((uint64_t)x == y);
}
int main() {
float x = 0x1p64f;
uint64_t y = 0xffffffffffffffff;
if (cmp_flt_uint64(x,y))
printf("true\n");
else
printf("false\n");
;
}
The logic here is as follows:
- The first equality can be true only if
x
is a non-negative integer in the interval [0,264].
- The second checks that
x
(and hence (float)y
) is not 264: if this is the case, then y
cannot be represented exactly by a float
, and so the comparison is false.
- Any remaining values of
x
can be exactly converted to a uint64_t
, and so we cast and compare.
No, you need to compare (long double)x == (long double)y
on an architecture where the mantissa of a long double can hold 63 bits. This is because some big long long ints will lose precision when you convert them to float, and compare as equal to a non-equivalent float, but if you convert to long double, it will not lose precision on that architecture.
The following program demonstrates this behavior when compiled with gcc -std=c99 -mssse3 -mfpmath=sse
on x86, because these settings use wide-enough long doubles but prevent the implicit use of higher-precision types in calculations:
#include <assert.h>
#include <stdint.h>
const int64_t x = (1ULL<<62) - 1ULL;
const float y = (float)(1ULL<<62);
// The mantissa is not wide enough to store
// 63 bits of precision.
int main(void)
{
assert ((float)x == (float)y);
assert ((long double)x != (long double)y);
return 0;
}
Edit: If you don’t have wide enough long doubles, the following might work:
feclearexcept(FE_ALL_EXCEPT);
x == y;
ftestexcept(FE_INEXACT);
I think, although I could be mistaken, that an implementation could round off x during the conversion in a way that loses precision.
Another strategy that could work is to compare
extern uint64_t x;
extern float y;
const float z = (float)x;
y == z && (uint64_t)z == x;
This should catch losses of precision due to round-off error, but it could conceivably cause undefined behavior if the conversion to z rounds up. It will work if the conversion is set to round toward zero when converting x to z.