The use case:
I have some large data arrays containing floating point constants that. The file defining that array is generated and the template can be easily adapted.
I would like to make some tests, how reduced precision does influence the results in terms of quality, but also in compressibility of the binary.
Since I do not want to change other source code than the generated file, I am looking for a way to reduce the precision of the constants.
I would like to limit the mantissa to a fixed number of bits (set the lower ones to 0). But since floating point literals are in decimal, there are some difficulties, specifying numbers in a way that the binary representation does contain all zeros at the lower mantissa bits.
The best case would be something like:
#define FP_REDUCE(float) /* some macro */
static const float32_t veryLargeArray[] = {
FP_REDUCE(23.423f), FP_REDUCE(0.000023f), FP_REDUCE(290.2342f),
// ...
};
#undef FP_REDUCE
This should be done at compile time and it should be platform independent.
What you're asking for can be done with varying degrees of partial portability, but not absolute unless you want to run the source file through your own preprocessing tool at build time to reduce the precision. If that's an option for you, it's probably your best one.
Short of that, I'm going to assume at least that your floating point types are base 2 and obey Annex F/IEEE semantics. This should be a reasonable assumption, but the latter is false with gcc on platforms (including 32-bit x86) with extended-precision under the default standards-conformance profile; you need
-std=cNN
or-fexcess-precision=standard
to fix it.One approach is to add and subtract a power of two chosen to cause rounding to the desired precision:
Unfortunately, this works in absolute precisions, not relative, and requires knowing the right value
p
for the particularx
, which is going to be equal to the value of the leading base-2 place ofx
, times 2 raised to the power ofFLT_MANT_DIG
minus the bits of precision you want. This cannot be evaluated as a constant expression for use as an initializer, but you can write it in terms ofFLT_EPSILON
and, if you can assume C99+, a preprocessor-token-pasting to form a hex float literal, yielding the correct value for this factor. But you still need to know the power of two for the leading digit ofx
; I don't see any way to extract that as a constant expression.Edit: I believe this is fixable, so as not to need an absolute precision but rather automatically scale to the value, but it depends on correctness of a work in progress. See Is there a correct constant-expression, in terms of a float, for its msb?. If that works I will later integrate the result with this answer.
Another approach I like, if your compiler supports compound literals in static initializers and if you can assume IEEE type representations, is using a union and masking off bits:
where
n
is the number of bits you want to drop. This will round towards zero rather than to nearest; if you want to make it round to nearest, it should be possible by adding an appropriate constant to the low bits before masking, but you have to take care about what happens when the addition overflows into the exponent bits.The following uses the Veltkamp-Dekker splitting algorithm to remove n bits (with rounding) from x, where p = 2n (for example, to remove eight bits, use
0x1p8f
for the second argument). The casts tofloat32_t
coerce the results to that type, as the C standard otherwise permits implementations to use more precision within expressions. (Double-rounding could produce incorrect results in theory, but this will not occur whenfloat32_t
is the IEEE basic 32-bit binary format and the C implementation computes this expression in that format or the 64-bit format or wider, as the former is the desired format and the latter is wide enough to represent intermediate results exactly.)IEEE-754 binary floating-point is assumed, with round-to-nearest. Overflow occurs if x•(p+1) rounds to infinity.