platform independent way to reduce precision of fl

The use case:

I have some large data arrays containing floating point constants that. The file defining that array is generated and the template can be easily adapted.

I would like to make some tests, how reduced precision does influence the results in terms of quality, but also in compressibility of the binary.

Since I do not want to change other source code than the generated file, I am looking for a way to reduce the precision of the constants.

I would like to limit the mantissa to a fixed number of bits (set the lower ones to 0). But since floating point literals are in decimal, there are some difficulties, specifying numbers in a way that the binary representation does contain all zeros at the lower mantissa bits.

The best case would be something like:

#define FP_REDUCE(float)  /* some macro  */

static const float32_t veryLargeArray[] = {
  FP_REDUCE(23.423f), FP_REDUCE(0.000023f), FP_REDUCE(290.2342f),
  // ... 
};

#undef FP_REDUCE

This should be done at compile time and it should be platform independent.

标签： c floating-point precision

2条回答

叛逆

2楼-- · 2019-08-26 07:19

What you're asking for can be done with varying degrees of partial portability, but not absolute unless you want to run the source file through your own preprocessing tool at build time to reduce the precision. If that's an option for you, it's probably your best one.

Short of that, I'm going to assume at least that your floating point types are base 2 and obey Annex F/IEEE semantics. This should be a reasonable assumption, but the latter is false with gcc on platforms (including 32-bit x86) with extended-precision under the default standards-conformance profile; you need -std=cNN or -fexcess-precision=standard to fix it.

One approach is to add and subtract a power of two chosen to cause rounding to the desired precision:

#define FP_REDUCE(x,p) ((x)+(p)-(p))

Unfortunately, this works in absolute precisions, not relative, and requires knowing the right value p for the particular x, which is going to be equal to the value of the leading base-2 place of x, times 2 raised to the power of FLT_MANT_DIG minus the bits of precision you want. This cannot be evaluated as a constant expression for use as an initializer, but you can write it in terms of FLT_EPSILON and, if you can assume C99+, a preprocessor-token-pasting to form a hex float literal, yielding the correct value for this factor. But you still need to know the power of two for the leading digit of x; I don't see any way to extract that as a constant expression.

Edit: I believe this is fixable, so as not to need an absolute precision but rather automatically scale to the value, but it depends on correctness of a work in progress. See Is there a correct constant-expression, in terms of a float, for its msb?. If that works I will later integrate the result with this answer.

Another approach I like, if your compiler supports compound literals in static initializers and if you can assume IEEE type representations, is using a union and masking off bits:

union { float x; uint32_t r; } fr;
#define FP_REDUCE(x) ((union fr){.r=(union fr){x}.r & (0xffffffffu<<n)}.x)

where n is the number of bits you want to drop. This will round towards zero rather than to nearest; if you want to make it round to nearest, it should be possible by adding an appropriate constant to the low bits before masking, but you have to take care about what happens when the addition overflows into the exponent bits.

0人赞添加讨论(0) 举报

We Are One

3楼-- · 2019-08-26 07:29

The following uses the Veltkamp-Dekker splitting algorithm to remove n bits (with rounding) from x, where p = 2ⁿ (for example, to remove eight bits, use 0x1p8f for the second argument). The casts to float32_t coerce the results to that type, as the C standard otherwise permits implementations to use more precision within expressions. (Double-rounding could produce incorrect results in theory, but this will not occur when float32_t is the IEEE basic 32-bit binary format and the C implementation computes this expression in that format or the 64-bit format or wider, as the former is the desired format and the latter is wide enough to represent intermediate results exactly.)

IEEE-754 binary floating-point is assumed, with round-to-nearest. Overflow occurs if x•(p+1) rounds to infinity.

#define RemoveBits(x, p) (float32_t) (((float32_t) ((x) * ((p)+1))) - (float32_t) (((float32_t) ((x) * ((p)+1))) - (x))))

0人赞添加讨论(0) 举报

platform independent way to reduce precision of fl

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间