With GCC 5.3 the following code compield with -O3 -fma
float mul_add(float a, float b, float c) {
return a*b + c;
}
produces the following assembly
vfmadd132ss %xmm1, %xmm2, %xmm0
ret
I noticed GCC doing this with -O3
already in GCC 4.8.
Clang 3.7 with -O3 -mfma
produces
vmulss %xmm1, %xmm0, %xmm0
vaddss %xmm2, %xmm0, %xmm0
retq
but Clang 3.7 with -Ofast -mfma
produces the same code as GCC with -O3 fast
.
I am surprised that GCC does with -O3
because from this answer it says
The compiler is not allowed to fuse a separated add and multiply unless you allow for a relaxed floating-point model.
This is because an FMA has only one rounding, while an ADD + MUL has two. So the compiler will violate strict IEEE floating-point behaviour by fusing.
However, from this link it says
Regardless of the value of FLT_EVAL_METHOD, any floating-point expression may be contracted, that is, calculated as if all intermediate results have infinite range and precision.
So now I am confused and concerned.
- Is GCC justified in using FMA with
-O3
? - Does fusing violate strict IEEE floating-point behaviour?
- If fusing does violate IEEE floating-point beahviour and since GCC returns
__STDC_IEC_559__
isn't this a contradiction?
Since FMA can be emulated in software it seems to be there should be two compiler switches for FMA: one to tell the compiler to use FMA in calculations and one to tell the compiler that the hardware has FMA.
Apprently this can be controlled with the option -ffp-contract
. With GCC the default is -ffp-contract=fast
and with Clang it's not. Other options such as -ffp-contract=on
and -ffp-contract=off
do no produce the FMA instruction.
For example Clang 3.7 with -O3 -mfma -ffp-contract=fast
produces vfmadd132ss
.
I checked some permutations of #pragma STDC FP_CONTRACT
set to ON
and OFF
with -ffp-contract
set to on
, off
, and fast
. IN all cases I also used -O3 -mfma
.
With GCC the answer is simple. #pragma STDC FP_CONTRACT
ON or OFF makes no difference. Only -ffp-contract
matters.
GCC it uses fma
with
-ffp-contract=fast
(default).
With Clang it uses fma
- with
-ffp-contract=fast
. - with
-ffp-contract=on
(default) and#pragma STDC FP_CONTRACT ON
(default isOFF
).
In other words with Clang you can get fma
with #pragma STDC FP_CONTRACT ON
(since -ffp-contract=on
is the default) or with -ffp-contract=fast
. -ffast-math
(and hence -Ofast
) set -ffp-contract=fast
.
I looked into MSVC and ICC.
With MSVC it uses the fma instruction with /O2 /arch:AVX2 /fp:fast
. With MSVC /fp:precise
is the default.
With ICC it uses fma with -O3 -march=core-avx2
(acctually -O1
is sufficient). This is because by default ICC uses -fp-model fast
. But ICC uses fma even with -fp-model precise
. To disable fma with ICC use -fp-model strict
or -no-fma
.
So by default GCC and ICC use fma when fma is enabled (with -mfma
for GCC/Clang or -march=core-avx2
with ICC) but Clang and MSVC do not.
It doesn't violate IEEE-754, because IEEE-754 defers to languages on this point:
In standard C, the
STDC FP_CONTRACT
pragma provides the means to control this value-changing optimization. So GCC is licensed to perform the fusion by default, so long as it allows you to disable the optimization by settingSTDC FP_CONTRACT OFF
. Not supporting that means not adhering to the C standard.When you quoted that fused multiply-add is allowed, you left out the important condition "unless pragma FP_CONTRACT is off". Which is a newish feature in C (I think introduced in C99) and was made absolutely necessary by PowerPC, which all had fused multiply-add from the start - actually, x*y was equivalent to fma (x, y, 0) and x+y was equivalent to fma (1.0, x, y).
FP_CONTRACT is what controls fused multiply/add, not FLT_EVAL_METHOD. Although if FLT_EVAL_METHOD allows higher precision, then contracting is always legal; just pretend that the operations were performed with very high precision and then rounded.
The fma function is useful if you don't want the speed, but the precision. It will calculate the contracted result slowly but correctly even if it isn't available in hardware. And should be inlined if it is available in hardware.