I am writing a protocol, that uses RFC 7049 as its binary representation. The standard states, that the protocol may use 32-bit floating point representation of numbers, if their numeric value is equivalent to respective 64-bit numbers. The conversion must not lead to lose of precision.
- What 32-bit float numbers can be bigger than 64-bit integer and numerically equivalent with them?
- Is comparing
float x; uint64_t y; (float)x == (float)y
enough for ensuring, that the values are equivalent? Will this comparison ever be true?
For the purposes of this specification, all number representations for the same numeric value are equivalent. This means that an encoder can encode a floating-point value of 0.0 as the integer 0. It, however, also means that an application that expects to find integer values only might find floating-point values if the encoder decides these are desirable, such as when the floating-point value is more compact than a 64-bit integer.
The following is based on Julia's method for comparing floats and integers. This does not require access to 80-bit
long double
s or floating point exceptions, and should work under any rounding mode. I believe this should work for any Cfloat
type (IEEE754 or not), and not cause any undefined behaviour.UPDATE: technically this assumes a binary
float
format, and that thefloat
exponent size is large enough to represent 264: this is certainly true for the standard IEEE754 binary32 (which you refer to in your question), but not, say, binary16.The logic here is as follows:
x
is a non-negative integer in the interval [0,264].x
(and hence(float)y
) is not 264: if this is the case, theny
cannot be represented exactly by afloat
, and so the comparison is false.x
can be exactly converted to auint64_t
, and so we cast and compare.No, you need to compare
(long double)x == (long double)y
on an architecture where the mantissa of a long double can hold 63 bits. This is because some big long long ints will lose precision when you convert them to float, and compare as equal to a non-equivalent float, but if you convert to long double, it will not lose precision on that architecture.The following program demonstrates this behavior when compiled with
gcc -std=c99 -mssse3 -mfpmath=sse
on x86, because these settings use wide-enough long doubles but prevent the implicit use of higher-precision types in calculations:Edit: If you don’t have wide enough long doubles, the following might work:
I think, although I could be mistaken, that an implementation could round off x during the conversion in a way that loses precision.
Another strategy that could work is to compare
This should catch losses of precision due to round-off error, but it could conceivably cause undefined behavior if the conversion to z rounds up. It will work if the conversion is set to round toward zero when converting x to z.
There certainly are numbers for which this is true:
2^33 can be perfectly represented as a floating point number, but clearly cannot be represented as a 32-bit integer. The following code should work as expected:
It is important to notice though that we are basically doing (int64_t)(float)value and not the other way around - we are interested if the cast to float loses any precision.
The check to see whether repr is smaller than the maximum value of int64_t is important since we could invoke undefined behavior otherwise, since the cast to float may round up to the next higher number (which could then be larger than the maximum value possible in int64_t). (Thanks to @tmyklebu for pointing this out).
Two samples: