With GCC 5.3 the following code compield with -O3 -fma
float mul_add(float a, float b, float c) {
return a*b + c;
}
produces the following assembly
vfmadd132ss %xmm1, %xmm2, %xmm0
ret
I noticed GCC doing this with -O3
already in GCC 4.8.
Clang 3.7 with -O3 -mfma
produces
vmulss %xmm1, %xmm0, %xmm0
vaddss %xmm2, %xmm0, %xmm0
retq
but Clang 3.7 with -Ofast -mfma
produces the same code as GCC with -O3 fast
.
I am surprised that GCC does with -O3
because from this answer it says
The compiler is not allowed to fuse a separated add and multiply unless you allow for a relaxed floating-point model.
This is because an FMA has only one rounding, while an ADD + MUL has two. So the compiler will violate strict IEEE floating-point behaviour by fusing.
However, from this link it says
Regardless of the value of FLT_EVAL_METHOD, any floating-point expression may be contracted, that is, calculated as if all intermediate results have infinite range and precision.
So now I am confused and concerned.
- Is GCC justified in using FMA with
-O3
? - Does fusing violate strict IEEE floating-point behaviour?
- If fusing does violate IEEE floating-point beahviour and since GCC returns
__STDC_IEC_559__
isn't this a contradiction?
Since FMA can be emulated in software it seems to be there should be two compiler switches for FMA: one to tell the compiler to use FMA in calculations and one to tell the compiler that the hardware has FMA.
Apprently this can be controlled with the option -ffp-contract
. With GCC the default is -ffp-contract=fast
and with Clang it's not. Other options such as -ffp-contract=on
and -ffp-contract=off
do no produce the FMA instruction.
For example Clang 3.7 with -O3 -mfma -ffp-contract=fast
produces vfmadd132ss
.
I checked some permutations of #pragma STDC FP_CONTRACT
set to ON
and OFF
with -ffp-contract
set to on
, off
, and fast
. IN all cases I also used -O3 -mfma
.
With GCC the answer is simple. #pragma STDC FP_CONTRACT
ON or OFF makes no difference. Only -ffp-contract
matters.
GCC it uses fma
with
-ffp-contract=fast
(default).
With Clang it uses fma
- with
-ffp-contract=fast
. - with
-ffp-contract=on
(default) and#pragma STDC FP_CONTRACT ON
(default isOFF
).
In other words with Clang you can get fma
with #pragma STDC FP_CONTRACT ON
(since -ffp-contract=on
is the default) or with -ffp-contract=fast
. -ffast-math
(and hence -Ofast
) set -ffp-contract=fast
.
I looked into MSVC and ICC.
With MSVC it uses the fma instruction with /O2 /arch:AVX2 /fp:fast
. With MSVC /fp:precise
is the default.
With ICC it uses fma with -O3 -march=core-avx2
(acctually -O1
is sufficient). This is because by default ICC uses -fp-model fast
. But ICC uses fma even with -fp-model precise
. To disable fma with ICC use -fp-model strict
or -no-fma
.
So by default GCC and ICC use fma when fma is enabled (with -mfma
for GCC/Clang or -march=core-avx2
with ICC) but Clang and MSVC do not.