I would like to have a broad view about "denormal data" and what it's about because the only thing that I think I got right is the fact that is something especially related to floating point values from a programmer viewpoint and it's related to a general-computing approach from the CPU standpoint .
Someone can decrypt this 2 words for me ?
EDIT
please remember that I'm oriented to C++ applications and only the C++ language.
From the IEEE Documentation
You ask about C++, but the specifics of floating-point values and encodings are determined by a floating-point specification, notably IEEE 754, and not by C++. IEEE 754 is by far the most widely used floating-point specification, and I will answer using it.
In IEEE 754, binary floating-point values are encoded with three parts: A sign bit s (0 for positive, 1 for negative), a biased exponent e (the represented exponent plus a fixed offset), and a significand field f (the fraction portion). For normal numbers, these represent exactly the number (-1)s • 2e-bias • 1.f, where 1.f is the binary numeral formed by writing the significand bits after “1.”. (For example, if the significand field has the ten bits 0010111011, it represents the significand 1.00101110112, which is 1.182617175 or 1211/1024.)
The bias depends on the floating-point format. For 64-bit IEEE 754 binary, the exponent field has 11 bits, and the bias is 1023. When the actual exponent is 0, the encoded exponent field is 1023. Actual exponents of -2, -1, 0, 1, and 2 have encoded exponents of 1021, 1022, 1023, 1024, and 1025. When somebody speaks of the exponent of a subnormal number being zero they mean the encoded exponent is zero. The actual exponent would be less than -1022. For 64-bit, the normal exponent interval is -1022 to 1023 (encoded values 1 to 2046). When the exponent moves outside this interval, special things happen.
Above this interval, floating-point stops representing finite numbers. An encoded exponent of 2047 (all 1 bits) represents infinity (with the significand field set to zero). Below this range, floating-point changes to subnormal numbers. When the encoded exponent is zero, the significand field represents 0.f instead of 1.f.
There is an important reason for this. If the lowest exponent value were just another normal encoding, then the lower bits of its significand would be too small to represent as a floating-point values by themselves. Without that leading “1.”, there would be no way to say where the first 1 bit was. For example, suppose you had two numbers, both with the lowest exponent, and with significands 1.00101110112 and 1.00000000002. When you subtract the significands, the result is .00101110112. Unfortunately, there is no way to represent this as a normal number. Because you were already at the lowest exponent, you cannot represent the lower exponent that is needed to say where the first 1 is in this result. Since the mathematical result is too small to be represented, a computer would be forced to return the nearest representable number, which would be zero.
This creates the undesirable property in the floating-point system that you can have
a != b
buta-b == 0
. To avoid that, subnormal numbers are used. By using subnormal numbers, we have a special interval where the actual exponent does not decrease, and we can perform arithmetic without creating numbers too small to represent. When the encoded exponent is zero, the actual exponent is the same as when the encoded exponent is one, but the value of the significand changes to 0.f instead of 1.f. When we do this,a != b
guarantees that the computed value ofa-b
is not zero.Here are the combinations of values in the encodings of 64-bit IEEE 754 binary floating-point:
Some notes:
+0 and -0 are mathematically equal, but the sign is preserved. Carefully written applications can make use of it in certain special situations.
NaN means “Not a Number”. Commonly, it means some non-mathematical result or other error has occurred, and a calculation should be discarded or redone another way. Generally, an operation with a NaN produces another NaN, thus preserving the information that something has gone wrong. For example,
3 + NaN
produces a NaN. A signaling NaN is intended to cause an exception, either to indicate that a program has gone wrong or to allow other software (e.g., a debugger) to perform some special action. A quiet NaN is intended to propagate through to further results, allowing the rest of a large computation to be completed, in the cases where a NaN is only a part of a large set of data and will be handled separately later or will be discarded.The signs, + and -, are retained with NaNs but have no mathematical value.
In normal programming, you should not be concerned about the floating-point encoding, except to the extent it informs you about the limits and behavior of floating-point calculations. You should not need to do anything special regarding subnormal numbers.
Unfortunately, some processors are broken in that they either violate the IEEE 754 standard by changing subnormal numbers to zero or they perform very slowly when subnormal numbers are used. When programming for such processors, you may seek to avoid using subnormal numbers.
To understand de-normal floating point values you first have to understand normal ones. A floating point value has a mantissa and an exponent. In a decimal value, like 1.2345E6, 1.2345 is the mantissa, 6 is the exponent. A nice thing about floating point notation is that you can always write it normalized. Like 0.012345E8 and 0.12345E7 is the same value as 1.2345E6. Or in other words, you can always make the first digit of the mantissa a non-zero number, as long as the value is not zero.
Computers store floating point values in binary, the digits are 0 or 1. So a property of a binary floating point value that is not zero is that it can always be written starting with a 1.
This is a very attractive optimization target. Since the value always starts with 1, there is no point in storing that 1. What is nice about it is that you in effect get an extra bit of precision for free. On a 64-bit double, the mantissa has 52 bits of storage. The actual precision is 53 bits thanks to the implied 1.
We have to talk about the smallest possible floating point value that you can store this way. Doing it in decimal first, if you had a decimal processor with 5 digits of storage in the mantissa and 2 in the exponent then the smallest value it could store that isn't zero is 1.00000E-99. With 1 being the implied digit that isn't stored (doesn't work in decimal but bear with me). So the mantissa stores 00000 and the exponent stores -99. You cannot store a smaller number, the exponent is maxed-out at -99.
Well, you can. You could give up on the normalized representation and forget about the implied digit optimization. You can store it de-normalized. Now you can store 0.1000E-99, or 1.000E-100. All the way down to 0.0001E-99 or 1E-103, the absolute smallest number you can now store.
This is in general desirable, it extends the range of values you can store. Which tends to matter in practical computations, very small numbers are very common in real-world problems like differential analysis.
There's however also a big problem with it, you lose accuracy with de-normalized numbers. The accuracy of floating point calculations is limited by the number of digits you can store. It is intuitive with the fake decimal processor I used as an example, it can only ever compute with 5 significant digits. As long as the value is normalized, you always get 5 significant digits.
But you'll lose digits when you de-normalize. Any value between 0.1000E-99 and 0.9999E-99 has only 4 significant digits. Any value between 0.0100E-99 and 0.0999E-99 has only 3 significant digits. All the way down to 0.0001E-99 and 0.0009E-99, only one significant digit left.
This can greatly reduce the accuracy of the final calculation result. What's worse, it does so in a highly unpredictable manner since these very small de-normalized values tend to show up in a more involved calculation. That's certainly something to worry about, you cannot really trust the end result anymore when it has only 1 significant digit left.
Floating point processors have ways to let you know about this or otherwise sail around the problem. They can for example generate an interrupt or signal when a value becomes de-normalized, letting you interrupt the calculation. And they have a "flush-to-zero" option, a bit in the status word that tells the processor to automatically convert all de-normal values to zero. Which tends to generate infinities, an outcome that tells you that the result is junk and should be discarded.
IEEE 754 basics
First let's review the basics of IEEE 754 numbers are organized.
Let's focus on single precision (32-bit) first.
The format is:
Or if you like pictures:
Source.
The sign is simple: 0 is positive, and 1 is negative, end of story.
The exponent is 8 bits long, and so it ranges from 0 to 255.
The exponent is called biased because it has an offset of
-127
, e.g.:The leading bit convention
While designing IEEE 754, engineers noticed that all numbers, except
0.0
, have a one1
in binary as the first digitE.g.:
both start with that annoying
1.
part.Therefore, it would be wasteful to let that digit take up on precision bit almost every single number.
For this reason, they created the "leading bit convention":
But then how to deal with
0.0
? Well, they decided to create an exception:0.0
so that the bytes
00 00 00 00
also represent0.0
, which looks good.If we only considered these rules, then the smallest non-zero number that can be represented would be:
which looks something like this in an hex fraction due to the leading bit convention:
where
.000002
is 22 zeroes with a1
at the end.We cannot take
fraction = 0
, otherwise that number would be0.0
.But then the engineers, who also had a keen artistic sense, thought: isn't that ugly? That we jump from straight
0.0
to something that is not even a proper power of 2? Couldn't we represent even smaller numbers somehow?Denormal numbers
The engineers scratched their heads for a while, and came back, as usual, with another good idea. What if we create a new rule:
This rule immediately implies that the number such that:
is
0.0
, which is kind of elegant as it means one less rule to keep track of.So
0.0
is actually a subnormal number according to our definition!With this new rule then, the smallest non-subnormal number is:
which represents:
Then, the largest subnormal number is:
which equals:
where
.FFFFFE
is once again 23 bits one to the right of the dot.This is pretty close to the smallest non-subnormal number, which sounds sane.
And the smallest non-zero subnormal number is:
which equals:
which also looks pretty close to
0.0
!Unable to find any sensible way to represent numbers smaller than that, the engineers were happy, and went back to viewing cat pictures online, or whatever they did in the 70s instead.
As you can see, subnormal numbers do a trade-off between precision and representation length.
As the most extreme example, the smallest non-zero subnormal:
has essentially a precision of a single bit instead of 32-bits. For example, if we divide it by two:
we actually reach
0.0
exactly!Runnable C example
Now let's play with some actual code to verify our theory.
In almost all current and desktop machines, C
float
represents single precision IEEE 754 floating point numbers.This is in particular the case for my Ubuntu 18.04 amd64 laptop.
With that assumption, all assertions pass on the following program:
subnormal.c
GitHub upstream.
Compile and run with:
Visualization
It is always a good idea to have a geometric intuition about what we learn, so here goes.
If we plot IEEE 754 floating point numbers on a line for each given exponent, it looks something like this:
From that we can see that for each exponent:
*
)Now, let's bring that down all the way to exponent 0.
Without subnormals (hypothetical):
With subnormals:
By comparing the two graphs, we see that:
subnormals double the length of range of exponent
0
, from[2^-127, 2^-126)
to[0, 2^-126)
The space between floats in subnormal range is the same as for
[0, 2^-126)
.the range
[2^-127, 2^-126)
has half the number of points that it would have without subnormals.Half of those points go to fill the other half of the range.
the range
[0, 2^-127)
has some points with subnormals, but none without.the range
[2^-128, 2^-127)
has half the points than[2^-127, 2^-126)
.This is what we mean when saying that subnormals are a tradeoff between size and precision.
In this setup, we would have an empty gap between
0
and2^-127
, which is not very elegant.The interval is well populated however, and contains
2^23
floats like any other.Implementations
x86_64 implements IEEE 754 directly on hardware, which the C code translates to.
TODO: any notable examples of modern hardware that don't have subnormals?
TODO: does any implementation allow controlling it at runtime?
Subnormals seem to be less fast than normals in certain implementations: Why does changing 0.1f to 0 slow down performance by 10x?
Infinity and NaN
Here is a short runnable example: Ranges of floating point datatype in C?