Fused multiply add and default rounding modes

2019-01-09 02:07发布

问题:

With GCC 5.3 the following code compield with -O3 -fma

float mul_add(float a, float b, float c) {
  return a*b + c;
}

produces the following assembly

vfmadd132ss     %xmm1, %xmm2, %xmm0
ret

I noticed GCC doing this with -O3 already in GCC 4.8.

Clang 3.7 with -O3 -mfma produces

vmulss  %xmm1, %xmm0, %xmm0
vaddss  %xmm2, %xmm0, %xmm0
retq

but Clang 3.7 with -Ofast -mfma produces the same code as GCC with -O3 fast.

I am surprised that GCC does with -O3 because from this answer it says

The compiler is not allowed to fuse a separated add and multiply unless you allow for a relaxed floating-point model.

This is because an FMA has only one rounding, while an ADD + MUL has two. So the compiler will violate strict IEEE floating-point behaviour by fusing.

However, from this link it says

Regardless of the value of FLT_EVAL_METHOD, any floating-point expression may be contracted, that is, calculated as if all intermediate results have infinite range and precision.

So now I am confused and concerned.

  1. Is GCC justified in using FMA with -O3?
  2. Does fusing violate strict IEEE floating-point behaviour?
  3. If fusing does violate IEEE floating-point beahviour and since GCC returns __STDC_IEC_559__ isn't this a contradiction?

Since FMA can be emulated in software it seems to be there should be two compiler switches for FMA: one to tell the compiler to use FMA in calculations and one to tell the compiler that the hardware has FMA.


Apprently this can be controlled with the option -ffp-contract. With GCC the default is -ffp-contract=fast and with Clang it's not. Other options such as -ffp-contract=on and -ffp-contract=off do no produce the FMA instruction.

For example Clang 3.7 with -O3 -mfma -ffp-contract=fast produces vfmadd132ss.


I checked some permutations of #pragma STDC FP_CONTRACT set to ON and OFF with -ffp-contract set to on, off, and fast. IN all cases I also used -O3 -mfma.

With GCC the answer is simple. #pragma STDC FP_CONTRACT ON or OFF makes no difference. Only -ffp-contract matters.

GCC it uses fma with

  1. -ffp-contract=fast (default).

With Clang it uses fma

  1. with -ffp-contract=fast.
  2. with -ffp-contract=on (default) and #pragma STDC FP_CONTRACT ON (default is OFF).

In other words with Clang you can get fma with #pragma STDC FP_CONTRACT ON (since -ffp-contract=on is the default) or with -ffp-contract=fast. -ffast-math (and hence -Ofast) set -ffp-contract=fast.


I looked into MSVC and ICC.

With MSVC it uses the fma instruction with /O2 /arch:AVX2 /fp:fast. With MSVC /fp:precise is the default.

With ICC it uses fma with -O3 -march=core-avx2 (acctually -O1 is sufficient). This is because by default ICC uses -fp-model fast. But ICC uses fma even with -fp-model precise. To disable fma with ICC use -fp-model strict or -no-fma.

So by default GCC and ICC use fma when fma is enabled (with -mfma for GCC/Clang or -march=core-avx2 with ICC) but Clang and MSVC do not.

回答1:

It doesn't violate IEEE-754, because IEEE-754 defers to languages on this point:

A language standard should also define, and require implementations to provide, attributes that allow and disallow value-changing optimizations, separately or collectively, for a block. These optimizations might include, but are not limited to:

...

― Synthesis of a fusedMultiplyAdd operation from a multiplication and an addition.

In standard C, the STDC FP_CONTRACT pragma provides the means to control this value-changing optimization. So GCC is licensed to perform the fusion by default, so long as it allows you to disable the optimization by setting STDC FP_CONTRACT OFF. Not supporting that means not adhering to the C standard.



回答2:

When you quoted that fused multiply-add is allowed, you left out the important condition "unless pragma FP_CONTRACT is off". Which is a newish feature in C (I think introduced in C99) and was made absolutely necessary by PowerPC, which all had fused multiply-add from the start - actually, x*y was equivalent to fma (x, y, 0) and x+y was equivalent to fma (1.0, x, y).

FP_CONTRACT is what controls fused multiply/add, not FLT_EVAL_METHOD. Although if FLT_EVAL_METHOD allows higher precision, then contracting is always legal; just pretend that the operations were performed with very high precision and then rounded.

The fma function is useful if you don't want the speed, but the precision. It will calculate the contracted result slowly but correctly even if it isn't available in hardware. And should be inlined if it is available in hardware.