Assuming IEEE-754 conformance, is a float guaranteed to be preserved when transported through a double?
In other words, will the following assert always be satisfied?
int main()
{
float f = some_random_float();
assert(f == (float)(double)f);
}
Assume that f
could acquire any of the special values defined by IEEE, such as NaN and Infinity.
According to IEEE, is there a case where the assert will be satisfied, but the exact bit-level representation is not preserved after the transportation through double?
The code snippet is valid in both C and C++.
From C99:
I think, this guarantees you that a float->double->float conversion is going to preserve the original float value.
The standard also defines the macros
INFINITY
andNAN
in7.12 Mathematics <math.h>
:So, there's provision for such special values and conversions may just work for them as well (including for the minus infinity and negative zero).
You don't even need to assume IEEE. C89 says in 3.1.2.5:
And every other C and C++ standard says equivalent things. As far as I know, NaNs and infinities are "values of the type
float
", albeit values with some special-case rules when used as operands.The fact that the float -> double -> float conversion restores the original value of the
float
follows (in general) from the fact that numeric conversions all preserve the value if it's representable in the destination type.Bit-level representations are a slightly different matter. Imagine that there's a value of
float
that has two distinct bitwise representations. Then nothing in the C standard prevents the float -> double -> float conversion from switching one to the other. In IEEE that won't happen for "actual values" unless there are padding bits, but I don't know whether IEEE rules out a single NaN having distinct bitwise representations. NaNs don't compare equal to themselves anyway, so there's also no standard way to tell whether two NaNs are "the same NaN" or "different NaNs" other than maybe converting them to strings. The issue may be moot.One thing to watch out for is non-conforming modes of compilers, in which they keep super-precise values "under the covers", for example intermediate results left in floating-point registers and reused without rounding. I don't think that would cause your example code to fail, but as soon as you're doing floating-point
==
it's the kind of thing you start worrying about.The assertion will fail in flush-to-zero and/or denormalized-is-zero mode (e.g. code compiled with -mfpmath=sse, -fast-math, etc, but also on heaps of compilers and architectures as default, such as Intel's C++ compiler) if f is denormalized.
You cannot produce a denormalized float in that mode though, but the scenario is still possible:
a) Denormalized float comes from external source.
b) Some libraries tamper with FPU modes but forget (or intentionally avoid) setting them back after each function call to it, making it possible for caller to mismatch normalization.
Practical example which prints following:
The example works both for VC2010 and GCC 4.3 but assumes that VC uses SSE for math as default and GCC uses FPU for math as default. The example may fail to illustrate the problem otherwise.