While investigating floating-point exception status flags, I came across the curious case of a status flag FE_UNDERFLOW
set when not expected.
This is similar to When does underflow occur? yet goes into a corner case that may be a C specification issue or FP hardware defect.
// pseudo code
// s bias_expo implied "mantissa"
w = smallest_normal; // 0 000...001 (1) 000...000
x = w * 2; // 0 000...010 (1) 000...000
y = next_smaller(x); // 0 000...001 (1) 111...111
round_mode(FE_TONEAREST);
clear_status_flags();
z = y/2; // 0 000...001 (1) 000...000
FE_UNDERFLOW is set!?
I did not expect FE_UNDERFLOW
to be set as z
above is normal, not a sub-normal.
I expect FE_UNDERFLOW
when the result of an earlier floating-point operation was subnormal with a loss of precision. In this case there is a loss of precision.
I tried this with my float
and long double
and had the same result.
After much investigation, I noted that __STDC_IEC_559__
is not defined.
Questions
If
__STDC_IEC_559__
is defined, what is the correct state of underflow in this case?With lack of a defined
__STDC_IEC_559__
am I stuck with "Implementations that do not define__STDC_IEC_559__
are not required to conform to these specifications." C11 or is there some C specification that indicates this result is incorrect?Since this is certainly a result of my hardware (processor), your result may differ and that would be interesting to know.
Follows is some test code that demos this. At first I suspected it is because FLT_EVAL_METHOD = 2
on my machine, yet then I tried similar code with long double
and the same result.
// These 2 includes missing in original post, yet in my true test code
#include <float.h>
#include <math.h>
#include <fenv.h>
#include <stdio.h>
#include <stdint.h>
#define N (sizeof excepts/sizeof excepts[0])
void Report_IEC_FP_exception_status_flags(const char *s) {
printf("%s", s);
int excepts[] = { //
FE_DIVBYZERO, FE_INEXACT, FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW, };
const char *excepts_str[N] = { //
"FE_DIVBYZERO", "FE_INEXACT", "FE_INVALID", "FE_OVERFLOW", "FE_UNDERFLOW", };
int excepts_val[N];
for (unsigned i = 0; i < N; i++) {
excepts_val[i] = fetestexcept(excepts[i]);
}
for (unsigned i = 0; i < N; i++) {
if (excepts_val[i]) printf(" %s", excepts_str[i]);
}
printf("\n");
fflush(stdout);
}
#undef N
void test2(float f, int round_mode, const char *name) {
union {
float f;
uint32_t u32;
} x = { .f = f};
printf("x:%+17a %08lX normal:%c round_mode:%d %s\n", //
f, (unsigned long) x.u32, isnormal(f) ? 'Y' : 'n', round_mode, name);
if (feclearexcept(FE_ALL_EXCEPT)) puts("Clear Fail");
Report_IEC_FP_exception_status_flags("Before:");
f /= 2;
Report_IEC_FP_exception_status_flags("After :");
printf("y:%+17a %08lX normal:%c\n\n",
f,(unsigned long) x.u32, isnormal(f) ? 'Y' : 'n');
}
Driver
// In same file as above
int main(void) {
#ifdef __STDC_IEC_559__
printf("__STDC_IEC_559__ = %d\n", __STDC_IEC_559__);
#else
printf("__STDC_IEC_559__ = not define\n");
#endif
float f = FLT_MIN;
printf("FLT_EVAL_METHOD = %d\n", FLT_EVAL_METHOD);
printf("FLT_MIN:%+17a\n", f);
f *= 2.0f;
test2(f, FE_TONEAREST, "FE_TONEAREST");
f = nextafterf(f, 0);
test2(f, FE_TONEAREST, "FE_TONEAREST"); // *** problem? ***
f = nextafterf(f, 0);
test2(f, FE_TONEAREST, "FE_TONEAREST");
}
Output
__STDC_IEC_559__ = not define
FLT_EVAL_METHOD = 2
FLT_MIN: +0x1p-126
x: +0x1p-125 01000000 normal:Y round_mode:0 FE_TONEAREST
Before:
After :
y: +0x1p-126 01000000 normal:Y
x: +0x1.fffffep-126 00FFFFFF normal:Y round_mode:0 FE_TONEAREST
Before:
After : FE_INEXACT FE_UNDERFLOW *** Why FE_UNDERFLOW? ***
y: +0x1p-126 00FFFFFF normal:Y *** Result is normal ***
x: +0x1.fffffcp-126 00FFFFFE normal:Y round_mode:0 FE_TONEAREST
Before:
After :
y: +0x1.fffffcp-127 00FFFFFE normal:n
Ref
IEEE_754
Implementation notes:
GNU C11 (GCC) version 6.4.0 (i686-pc-cygwin) compiled by GNU C version 6.4.0, GMP version 6.1.2, MPFR version 3.1.5-p10, MPC version 1.0.3, isl version 0.14 or 0.13
glibc 2.26 released.
Intel Xeon W3530, 64-bit OS (Windows 7)
[Minor Update] The illustrative print of the quotient as an 32-bit hex number should have used y.u32
. This does not change the function under test
// printf("y:%+17a %08lX normal:%c\n\n",
// f,(unsigned long) x.u32, isnormal(f) ? 'Y' : 'n');
union {
float f;
uint32_t u32;
} y = { .f = f};
printf("y:%+17a %08lX normal:%c\n\n",
f,(unsigned long) y.u32, isnormal(f) ? 'Y' : 'n');
// ^^^^^