Regarding minimising the error in floating-point operations, if I have an operation such as the following in C:
float a = 123.456;
float b = 456.789;
float r = 0.12345;
a = a - (r * b);
Will the result of the calculation change if I split the multiplication and subtraction steps out, i.e.:
float c = r * b;
a = a - c;
I am wondering whether a CPU would then treat these calculations differently and thereby the error may be smaller in one case?
If not, which I presume anyway, are there any good rules-of-thumb to mitigate against floating-point error? Can I massage data in a way that will help?
Please don't just say "use higher precision" - that's not what I'm after.
EDIT
For information about the data, in the general sense errors seem to be worse when the operation results in a very large number like 123456789. Small numbers, such as 1.23456789, seem to yield more accurate results after operations. Am I imagining this, or would scaling larger numbers help accuracy?
Note: this answer starts with a lengthy discussion of the distinction between
a = a - (r * b);
andfloat c = r * b; a = a - c;
with a c99-compliant compiler. The part of the question about the goal of improving accuracy while avoiding extended precision is covered at the end.Extended floating-point precision for intermediate results
If your C99 compiler defines
FLT_EVAL_METHOD
as 0, then the two computations can be expected to produce exactly the same result. If the compiler definesFLT_EVAL_METHOD
to 1 or 2, thena = a - (r * b);
will be more precise for some values ofa
,r
andb
, because all intermediate computations will be done at an extended precision (double
for the value 1 andlong double
for the value 2).The program cannot set
FLT_EVAL_METHOD
, but you can use commandline options to change the way your compiler computes with floating-point, and that will make it change its definition accordingly.Contraction of some intermediate results
Depending whether you use
#pragma fp_contract
in your program and on your compiler's default value for this pragma, some compound floating-point expressions can be contracted into single instructions that behave as if the intermediate result was computed with infinite precision. This happens to be a possibility for your example when targeting a modern processor, as the fused-multiply-add instruction will computea
directly and as accurately as allowed by the floating-point type.However, you should bear in mind that the contraction only take place at the compiler's option, without any guarantees. The compiler uses the FMA instruction to optimize speed, not accuracy, so the transformation may not take place at lower optimization levels. Sometimes several transformations are possible (e.g.
a * b + c * d
can be computed either asfmaf(c, d, a*b)
or asfmaf(a, b, c*d)
) and the compiler may choose one or the other.In short, the contraction of floating-point computations is not intended to help you achieve accuracy. You might as well make sure it is disabled if you like reproducible results.
However, in the particular case of the fused-multiply-add compound operation, you can use the C99 standard function
fmaf()
to tell the compiler to compute the multiplication and addition in a single step with a single rounding. If you do this, then the compiler will not be allowed to produce anything else than the best result fora
.Note that if the FMA instruction is not available, your compiler's implementation of the function
fmaf()
will at best just use higher precision, and if this happens on your compilation platform, your might just as well use the typedouble
for the accumulator: it will be faster and more accurate than usingfmaf()
. In the worst case, a flawed implementation offmaf()
will be provided.Improving accuracy while only using single-precision
Use Kahan summation if your computation involves a long chain of additions. Some accuracy can be gained by simply summing the
r*b
terms computed as single-precision products, assuming there are many of them. If you wish to gain more accuracy, you might want to computer*b
itself exactly as the sum of two single-precision numbers, but if you do this you might as well switch to double-single arithmetics entirely. Double-single arithmetics would be the same as the double-double technique succinctly described here, but with single-precision numbers instead.