Assuming IEEE-754 conformance, is a float guaranteed to be preserved when transported through a double?
In other words, will the following assert always be satisfied?
int main()
{
float f = some_random_float();
assert(f == (float)(double)f);
}
Assume that f
could acquire any of the special values defined by IEEE, such as NaN and Infinity.
According to IEEE, is there a case where the assert will be satisfied, but the exact bit-level representation is not preserved after the transportation through double?
The code snippet is valid in both C and C++.
You don't even need to assume IEEE. C89 says in 3.1.2.5:
The set of values of the type float
is a subset of the set of values
of the type double
And every other C and C++ standard says equivalent things. As far as I know, NaNs and infinities are "values of the type float
", albeit values with some special-case rules when used as operands.
The fact that the float -> double -> float conversion restores the original value of the float
follows (in general) from the fact that numeric conversions all preserve the value if it's representable in the destination type.
Bit-level representations are a slightly different matter. Imagine that there's a value of float
that has two distinct bitwise representations. Then nothing in the C standard prevents the float -> double -> float conversion from switching one to the other. In IEEE that won't happen for "actual values" unless there are padding bits, but I don't know whether IEEE rules out a single NaN having distinct bitwise representations. NaNs don't compare equal to themselves anyway, so there's also no standard way to tell whether two NaNs are "the same NaN" or "different NaNs" other than maybe converting them to strings. The issue may be moot.
One thing to watch out for is non-conforming modes of compilers, in which they keep super-precise values "under the covers", for example intermediate results left in floating-point registers and reused without rounding. I don't think that would cause your example code to fail, but as soon as you're doing floating-point ==
it's the kind of thing you start worrying about.
From C99:
6.3.1.5 Real floating types
1 When a float is promoted to double or long double, or a double is promoted to long double, its value is unchanged.
2 When a double is demoted to float, a long double is demoted to double or float, or a value being represented in greater precision and range than required by its semantic type (see 6.3.1.8) is explicitly converted to its semantic type, if the value being converted can be represented exactly in the new type, it is unchanged...
I think, this guarantees you that a float->double->float conversion is going to preserve the original float value.
The standard also defines the macros INFINITY
and NAN
in 7.12 Mathematics <math.h>
:
4 The macro INFINITY expands to a constant expression of type float representing positive or unsigned infinity, if available; else to a positive constant of type float that overflows at translation time.
5 The macro NAN is defined if and only if the implementation supports quiet NaNs for the float type. It expands to a constant expression of type float representing a quiet NaN.
So, there's provision for such special values and conversions may just work for them as well (including for the minus infinity and negative zero).
The assertion will fail in flush-to-zero and/or denormalized-is-zero mode (e.g. code compiled with -mfpmath=sse, -fast-math, etc, but also on heaps of compilers and architectures as default, such as Intel's C++ compiler) if f is denormalized.
You cannot produce a denormalized float in that mode though, but the scenario is still possible:
a) Denormalized float comes from external source.
b) Some libraries tamper with FPU modes but forget (or intentionally avoid) setting them back after each function call to it, making it possible for caller to mismatch normalization.
Practical example which prints following:
f = 5.87747e-39
f2 = 5.87747e-39
f = 5.87747e-39
f2 = 0
error, f != f2!
The example works both for VC2010 and GCC 4.3 but assumes that VC uses SSE for math as default and GCC uses FPU for math as default. The example may fail to illustrate the problem otherwise.
#include <limits>
#include <iostream>
#include <cmath>
#ifdef _MSC_VER
#include <xmmintrin.h>
#endif
template <class T>bool normal(T t)
{
return (t != 0 || fabsf( t ) >= std::numeric_limits<T>::min());
}
void csr_flush_to_zero()
{
#ifdef _MSC_VER
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
#else
unsigned csr = __builtin_ia32_stmxcsr();
csr |= (1 << 15);
__builtin_ia32_ldmxcsr(csr);
#endif
}
void test_cast(float f)
{
std::cout << "f = " << f << "\n";
double d = double(f);
float f2 = float(d);
std::cout << "f2 = " << f2 << "\n";
if(f != f2)
std::cout << "error, f != f2!\n";
std::cout << "\n";
}
int main()
{
float f = std::numeric_limits<float>::min() / 2.0;
test_cast(f);
csr_flush_to_zero();
test_cast(f);
}