I'm looking to for a reasonably efficient way of determining if a floating point value (double
) can be exactly represented by an integer data type (long
, 64 bit).
My initial thought was to check the exponent to see if it was 0
(or more precisely 127
). But that won't work because 2.0
would be e=1 m=1...
So basically, I am stuck. I have a feeling that I can do this with bit masks, but I'm just not getting my head around how to do that at this point.
So how can I check to see if a double is exactly representable as a long?
Thanks
Any IEEE floating-point
double
orfloat
value with a magnitude at or above 2^52 or 2^23 will be whole number. Adding 2^52 or 2^23 to a positive number whose magnitude is less than that will cause it to be rounded to a whole number. Subtracting the value that was added will yield a whole number which will equal the original iff the original was a whole number. Note that this algorithm will fail with some numbers larger than 2^52, but it isn't needed for numbers that big.Here's one method that could work in most cases. I'm not sure if/how it will break if you give it
NaN
,INF
, very large (overflow) numbers...(Though I think they will all return false - not exactly representable.)
You could:
Something like this:
floor()
andceil()
are also fair game (though they may fail if the value overflows an integer):And here's a messy bit-mask solution:
This uses union type-punning and assumes IEEE double-precision. Union type-punning is only valid in C99 TR2 and later.
Could you use the modulus operator to check if the double is divisible by one... or am I completely misunderstanding the question?
Range (
LONG_MIN, LONG_MAX
) and fraction (frexp()
) tests needed. Also need to watch out for not-a-numbers.The usual idea is to test like
(double)(long)x == x
, but to avoid its direct usage.(long)x
, whenx
is out of range, is undefined behavior (UB).The valid range of conversion for
(long)x
isLONG_MIN - 1 < x < LONG_MAX + 1
as code discards any fractional part ofx
during the conversion. So code needs to test, using FP math, ifx
is in range.I think I have found a way to clamp a
double
into an integer in a standard-conforming fashion (this is not really what the question is about, but it helps a lot). First, we need to see why the obvious code is not correct.The problem here is that in the second comparison,
UINT64_MAX
is being implicitly converted todouble
. The C standard does not specify exactly how this conversion works, only that it is to be rounded up or down to a representable value. This means that the second comparison may be false, even if should mathematically be true (which can happen whenUINT64_MAX
is rounded up, and 'x' is mathematically betweenUINT64_MAX
and(double)UINT64_MAX
). As such, the conversion ofdouble
touint64_t
can result in undefined behavior in that edge case.Surprisingly, the solution is very simple. Consider that while
UINT64_MAX
may not be exactly representable in adouble
,UINT64_MAX+1
, being a power of two (and not too large), certainly is. So, if we first round the input to an integer, the comparisonx > UINT64_MAX
is equivalent tox >= UINT64_MAX+1
, except for possible overflow in the addition. We can fix the overflow by usingldexp
instead of adding one toUINT64_MAX
. That being said, the following code should be correct.Now, to back to your question: is
x
is exactly representable in anuint64_t
? Only if it was neither rounded nor clamped.The same algorithm can be used for integers of different size, and also for signed integers with a minor modification. The code that follows does some very basic tests of the
uint32_t
anduint64_t
versions (only false positives can possibly be caught), but is also suitable for manual examination of the edge cases.You can use the modf function to split a float into the integer and fraction parts. modf is in the standard C library.