Rules-of-thumb for minimising floating-point error

2020-03-25 23:30发布

问题:

Regarding minimising the error in floating-point operations, if I have an operation such as the following in C:

float a = 123.456;
float b = 456.789;
float r = 0.12345;
a = a - (r * b);

Will the result of the calculation change if I split the multiplication and subtraction steps out, i.e.:

float c = r * b;
a = a - c;

I am wondering whether a CPU would then treat these calculations differently and thereby the error may be smaller in one case?

If not, which I presume anyway, are there any good rules-of-thumb to mitigate against floating-point error? Can I massage data in a way that will help?

Please don't just say "use higher precision" - that's not what I'm after.

EDIT

For information about the data, in the general sense errors seem to be worse when the operation results in a very large number like 123456789. Small numbers, such as 1.23456789, seem to yield more accurate results after operations. Am I imagining this, or would scaling larger numbers help accuracy?

回答1:

Note: this answer starts with a lengthy discussion of the distinction between a = a - (r * b); and float c = r * b; a = a - c; with a c99-compliant compiler. The part of the question about the goal of improving accuracy while avoiding extended precision is covered at the end.

Extended floating-point precision for intermediate results

If your C99 compiler defines FLT_EVAL_METHOD as 0, then the two computations can be expected to produce exactly the same result. If the compiler defines FLT_EVAL_METHOD to 1 or 2, then a = a - (r * b); will be more precise for some values of a, r and b, because all intermediate computations will be done at an extended precision (double for the value 1 and long double for the value 2).

The program cannot set FLT_EVAL_METHOD, but you can use commandline options to change the way your compiler computes with floating-point, and that will make it change its definition accordingly.

Contraction of some intermediate results

Depending whether you use #pragma fp_contract in your program and on your compiler's default value for this pragma, some compound floating-point expressions can be contracted into single instructions that behave as if the intermediate result was computed with infinite precision. This happens to be a possibility for your example when targeting a modern processor, as the fused-multiply-add instruction will compute a directly and as accurately as allowed by the floating-point type.

However, you should bear in mind that the contraction only take place at the compiler's option, without any guarantees. The compiler uses the FMA instruction to optimize speed, not accuracy, so the transformation may not take place at lower optimization levels. Sometimes several transformations are possible (e.g. a * b + c * d can be computed either as fmaf(c, d, a*b) or as fmaf(a, b, c*d)) and the compiler may choose one or the other.

In short, the contraction of floating-point computations is not intended to help you achieve accuracy. You might as well make sure it is disabled if you like reproducible results.

However, in the particular case of the fused-multiply-add compound operation, you can use the C99 standard function fmaf() to tell the compiler to compute the multiplication and addition in a single step with a single rounding. If you do this, then the compiler will not be allowed to produce anything else than the best result for a.


     float fmaf(float x, float y, float z);

DESCRIPTION
     The fma() functions compute (x*y)+z, rounded as one ternary operation:
     they compute the value (as if) to infinite precision and round once to
     the result format, according to the current rounding mode.

Note that if the FMA instruction is not available, your compiler's implementation of the function fmaf() will at best just use higher precision, and if this happens on your compilation platform, your might just as well use the type double for the accumulator: it will be faster and more accurate than using fmaf(). In the worst case, a flawed implementation of fmaf() will be provided.

Improving accuracy while only using single-precision

Use Kahan summation if your computation involves a long chain of additions. Some accuracy can be gained by simply summing the r*b terms computed as single-precision products, assuming there are many of them. If you wish to gain more accuracy, you might want to compute r*b itself exactly as the sum of two single-precision numbers, but if you do this you might as well switch to double-single arithmetics entirely. Double-single arithmetics would be the same as the double-double technique succinctly described here, but with single-precision numbers instead.