I had a function:
float lerp(float alpha, float x0, float x1) {
return (1.0f - alpha) * x0 + alpha * x1;
}
For those who haven't seen it, this is preferable to x0 + (x1-x0)
* alpha
because the latter doesn't guarantee that lerp(1.0f, x0, x1) == x1
.
Now, I want my lerp
function to have an additional property: I'd like lerp(alpha, x0, x1) == lerp(1-alpha, x1, x0)
. (As for why: this is a toy example of a more complicated function.) The solution I came up with that seems to work is
float lerp_symmetric(float alpha, float x0, float x1) {
float w0 = 1.0f - alpha;
float w1 = 1.0f - w0;
return w0 * x0 + w1 * x1;
}
This double subtraction has the effect of rounding near zero and near one, so if alpha = std::nextafter(0)
(1.4012985e-45), then 1 - alpha == 1
and so 1 - (1-alpha) == 0
. As far as I can tell, it is always true that 1.0f - x == 1.0f - (1.0f - (1.0f - x))
. It also seems to have the effect that w0 + w1 == 1.0f
.
Questions:
- Is this a reasonable approach?
- Can I trust my compiler to do what I want? In particular, I know on Windows it sometimes uses more precision for partial results, and I know the compiler is allowed to do some algebra; obviously 1-(1-x)==x algebraically.
This is in C++11 using Clang, VisualStudio, and gcc.
If one format of IEEE-754 binary floating-point is used throughout (e.g., basic 32-bit binary, the format commonly used for C++
float
), with all C++ operators mapped to IEEE-754 operations in the direct and simple way, thenlerp_symmetric(alpha, x0, x1)
(hereafter referred to asA
) equalslerp_symmetric(1-alpha, x1, x0)
(B
)Proof:
alpha
, which we assume is in [0, 1], is greater than or equal to ½, then1-alpha
is exact by Sterbenz’ lemma. (By “exact,” we mean the computed floating-point result equals the mathematical resulting; there is no rounding error.) Then, in computingA
,w0
is exact since it is1-alpha
, andw1
is exact since its mathematical result isalpha
, so it is exactly representable. And, in computingB
,w0
is exact since its mathematical result isalpha
, andw1
is exact since it is again1-alpha
.alpha
is less than ½, then1-alpha
may have some rounding error. Let the result bebeta
. Then, inA
,w0
isbeta
. Now ½ ≤beta
, so Sterbenz’ lemma applies to the evaluation ofw1 = 1.0f - w0
, sow1
is exact (and equals the mathematical result of1-beta
). And, inB
,w0
is exact, again by Sterbenz' lemma, and equals thew1
ofA
, andw1
(ofB
) is exact since its mathematical result isbeta
, which is exactly representable.Now we can see that
w0
inA
equalsw1
inB
andw1
inA
equalsw0
inB
. Lettingbeta
be1-alpha
in either of the above cases,A
andB
therefore return(1-beta)*x0 + beta*x1
andbeta*x1 + (1-beta)*x0
, respectively. IEEE-754 addition is commutative (except for NaN payloads), soA
andB
return identical results.Answering the questions:
I would say it is a reasonable approach. I would not assert there are not improvements that could be made without further thought.
No, you cannot trust your compiler:
w0*x0 + w1*x1
may be evaluated usingdouble
,long double
, or another precision even though all operands arefloat
.w0*x0 + w1*x1
may be evaluated asfmaf(w0, x0, w1*x1)
, thus using exact arithmetic for one of the multiplications but not the other.You can partially work around this by using:
The C++ standard requires that excess precision be discarded in assignments and casts. This extends to function returns. (I report this and other C++ specifications from memory; the standard should be checked.) So each of the above will round its result to
float
even if extra precision was initially used. This will prevent contraction.(One should also be able to disable contraction by including
<cmath>
and inserting the preprocessor directive#pragma STDC FP_CONTRACT off
. Some compilers might not support that.)One problem with the workaround above is that values are first rounded to the evaluation precision and then rounded to
float
. There are mathematical values for which, for such a value x, rounding x first todouble
(or another precision) and then tofloat
produces a different result than rounding x directly tofloat
. The dissertation A Rigorous Framework for Fully Supporting the IEEE Standard for Floating-Point Arithmetic in High-Level Programming Languages by Samuel A. Figueroa del Cid establishes that evaluating a single operation of multiplication or addition in IEEE-754 basic 64-bit floating-point (commonly used fordouble
) and then rounding to the 32-bit format never has a double-rounding error (because these operations, given inputs that are elements of the 32-bit format, can never produce one of the troublesome x values described above).1If I am correct about the C++ specifications reported from memory, then the workaround describe above should be complete as long as the C++ implementation evaluates floating-point expressions either with the nominal format or with a format sufficiently wider to satisfy the requirements Figueroa del Cid gives.
Footnote
1 Per Figueroa del Cid, if
x
andy
have p-bit significands, andx+y
orx*y
is computed exactly and then rounded to q places, a second rounding to p places will have the same answer as if the result were directly rounded to p places if p ≤ (q − 1)/2. This is satisfied for IEEE-754 basic 32-bit binary floating-point (p = 24) and 64-bit (q = 53). These formats are commonly used forfloat
anddouble
, and the workaround above should suffice in a C++ implementation that uses them. If a C++ implementation evaluatedfloat
using a precision that did not satisfy the condition Figueroa del Cid gives, then double-rounding errors could occur.