I have one double, and one int64_t. I want to know if they hold exactly the same value, and if converting one type into the other does not lose any information.
My current implementation is the following:
int int64EqualsDouble(int64_t i, double d) {
return (d >= INT64_MIN)
&& (d < INT64_MAX)
&& (round(d) == d)
&& (i == (int64_t)d);
}
My question is: is this implementation correct? And if not, what would be a correct answer? To be correct, it must leave no false positive, and no false negative.
Some sample inputs:
- int64EqualsDouble(0, 0.0) should return 1
- int64EqualsDouble(1, 1.0) should return 1
- int64EqualsDouble(0x3FFFFFFFFFFFFFFF, (double)0x3FFFFFFFFFFFFFFF) should return 0, because 2^62 - 1 can be exactly represented with int64_t, but not with double.
- int64EqualsDouble(0x4000000000000000, (double)0x4000000000000000) should return 1, because 2^62 can be exactly represented in both int64_t and double.
- int64EqualsDouble(INT64_MAX, (double)INT64_MAX) should return 0, because INT64_MAX can not be exactly represented as a double
- int64EqualsDouble(..., 1.0e100) should return 0, because 1.0e100 can not be exactly represented as an int64_t.
Yes, your solution works correctly because it was designed to do so, because int64_t
is represented in two's complement by definition (C99 7.18.1.1:1), on platforms that use something resembling binary IEEE 754 double-precision for the double
type. It is basically the same as this one.
Under these conditions:
d < INT64_MAX
is correct because it is equivalent to d < (double) INT64_MAX
and in the conversion to double, the number INT64_MAX
, equal to 0x7fffffffffffffff, rounds up. Thus you want d
to be strictly less than the resulting double
to avoid triggering UB when executing (int64_t)d
.
On the other hand, INT64_MIN
, being -0x8000000000000000, is exactly representable, meaning that a double
that is equal to (double)INT64_MIN
can be equal to some int64_t
and should not be excluded (and such a double
can be converted to int64_t
without triggering undefined behavior)
It goes without saying that since we have specifically used the assumptions about 2's complement for integers and binary floating-point, the correctness of the code is not guaranteed by this reasoning on platforms that differ. Take a platform with binary 64-bit floating-point and a 64-bit 1's complement integer type T
. On that platform T_MIN
is -0x7fffffffffffffff
. The conversion to double
of that number rounds down, resulting in -0x1.0p63
. On that platform, using your program as it is written, using -0x1.0p63
for d
makes the first three conditions true, resulting in undefined behavior in (T)d
, because overflow in the conversion from integer to floating-point is undefined behavior.
If you have access to full IEEE 754 features, there is a shorter solution:
#include <fenv.h>
…
#pragma STDC FENV_ACCESS ON
feclearexcept(FE_INEXACT), f == i && !fetestexcept(FE_INEXACT)
This solution takes advantage of the conversion from integer to floating-point setting the INEXACT flag iff the conversion is inexact (that is, if i
is not representable exactly as a double
).
The INEXACT flag remains unset and f
is equal to (double)i
if and only if f
and i
represent the same mathematical value in their respective types.
This approach requires the compiler to have been warned that the code accesses the FPU's state, normally with #pragma STDC FENV_ACCESS on
but that's typically not supported and you have to use a compilation flag instead.
OP's code has a dependency that can be avoided.
For a successful compare, d
must be a whole number and round(d) == d
takes care of that. Even d
, as a NaN would fail that.
d
must be mathematically in the range of [INT64_MIN
... INT64_MAX
] and if the if
conditions properly insure that, then the final i == (int64_t)d
completes the test.
So the question comes down to comparing INT64
limits with the double
d
.
Let us assume FLT_RADIX == 2
, but not necessarily IEEE 754 binary64.
d >= INT64_MIN
is not a problem as -INT64_MIN
is a power of 2 and exactly converts to a double
of the same value, so the >=
is exact.
Code would like to do the mathematical d <= INT64_MAX
, but that may not work and so a problem. INT64_MAX
is a "power of 2 - 1" and may not convert exactly - it depends on if the precision of the double
exceeds 63 bits - rendering the compare unclear. A solution is to halve the comparison. d/2
suffers no precision loss and INT64_MAX/2 + 1
converts exactly to a double
power-of-2
d/2 < (INT64_MAX/2 + 1)
[Edit]
// or simply
d < ((double)(INT64_MAX/2 + 1))*2
Thus if code does not want to rely on the double
having less precision than uint64_t
. (Something that likely applies with long double
) a more portable solution would be
int int64EqualsDouble(int64_t i, double d) {
return (d >= INT64_MIN)
&& (d < ((double)(INT64_MAX/2 + 1))*2) // (d/2 < (INT64_MAX/2 + 1))
&& (round(d) == d)
&& (i == (int64_t)d);
}
Note: No rounding mode issues.
[Edit] Deeper limit explanation
Insuring mathematically, INT64_MIN <= d <= INT64_MAX
, can be re-stated as INT64_MIN <= d < (INT64_MAX + 1)
as we are dealing with whole numbers. Since the raw application of (double) (INT64_MAX + 1)
in code is certainly 0, an alternative, is ((double)(INT64_MAX/2 + 1))*2
. This can be extended for rare machines with double
of higher powers-of-2 to ((double)(INT64_MAX/FLT_RADIX + 1))*FLT_RADIX
. The comparison limits being exact powers-of-2, conversion to double
suffers no precision loss and (lo_limit >= d) && (d < hi_limit)
is exact, regardless of the precision of the floating point. Note: that a rare floating point with FLT_RADIX == 10
is still a problem.
In addition to Pascal Cuoq's elaborate answer, and given the extra context you give in comments, I would add a test for negative zeros. You should preserve negative zeros unless you have good reasons not to. You need a specific test to avoid converting them to (int64_t)0
. With your current proposal, negative zeros will pass your test, get stored as int64_t
and read back as positive zeros.
I am not sure what is the most efficient way to test them, maybe this:
int int64EqualsDouble(int64_t i, double d) {
return (d >= INT64_MIN)
&& (d < INT64_MAX)
&& (round(d) == d)
&& (i == (int64_t)d
&& (!signbit(d) || d != 0.0);
}