Consider this simple code:
#include <complex.h>
complex float f(complex float x) {
return x*x;
}
If you compile it with -O3 -march=core-avx2 -fp-model strict
using the Intel Compiler you get:
f:
vmovsldup xmm1, xmm0 #3.12
vmovshdup xmm2, xmm0 #3.12
vshufps xmm3, xmm0, xmm0, 177 #3.12
vmulps xmm4, xmm1, xmm0 #3.12
vmulps xmm5, xmm2, xmm3 #3.12
vaddsubps xmm0, xmm4, xmm5 #3.12
ret
This is much simpler code than you get from both gcc
and clang
and also much simpler than the code you will find online for multiplying complex numbers. It doesn't, for example appear explicitly to deal with complex NaN or infinities.
Does this assembly meet the specs for C99 complex multiplication?
The code is non-conforming.
Annex G, Section 5.1, Paragraph 4 reads
The *
and /
operators satisfy the following infinity properties for all real, imaginary, and complex operands:
— if one operand is an infinity and the other operand is a nonzero finite number or an infinity, then the result of the * operator is an infinity;
So if z = a * ib is infinite and w = c * id is infinite, the number z * w must be infinite.
The same annex, Section 3, Paragraph 1 defines what it means for a complex number to be infinite:
A complex or imaginary value with at least one infinite part is regarded as an infinity (even if its other part is a NaN).
So z is infinite if either a or b are.
This is indeed a sensible choice as it reflects the mathematical framework1.
However if we let z = ∞ + i∞ (an infinite value) and w = i∞ (and infinite value) the result for the Intel code is z * w = NaN + iNaN due to the ∞ · 0 intermediates2.
This suffices to label it as non-conforming.
We can further confirm this by taking a look at the footnote on the first quote (the footnote was not reported here), it mentions the CX_LIMITED_RANGE
pragma directive.
Section 7.3.4, Paragraph 1 reads
The usual mathematical formulas for complex multiply, divide, and absolute value are problematic because of their treatment of infinities and because of undue overflow and underflow. The CX_LIMITED_RANGE
pragma can be used to inform the implementation that (where the state is ‘‘on’’) the usual mathematical formulas [that produces NaNs] are acceptable.
Here the standard committee is trying to alleviate the huge mole of work for the complex multiplication (and division).
In fact GCC has a flag to control this behaviour:
-fcx-limited-range
When enabled, this option states that a range reduction step is not needed when performing complex division.
Also, there is no checking whether the result of a complex multiplication or division is NaN + I*NaN, with an attempt to rescue the situation in that case.
The default is -fno-cx-limited-range
, but is enabled by -ffast-math
.
This option controls the default setting of the ISO C99 CX_LIMITED_RANGE
pragma.
It this option alone that makes GCC generate slow code and additional checks, without it the code it generate has the same flaws of Intel's one (I translated the source to C++)
f(std::complex<float>):
movq QWORD PTR [rsp-8], xmm0
movss xmm0, DWORD PTR [rsp-8]
movss xmm2, DWORD PTR [rsp-4]
movaps xmm1, xmm0
movaps xmm3, xmm2
mulss xmm1, xmm0
mulss xmm3, xmm2
mulss xmm0, xmm2
subss xmm1, xmm3
addss xmm0, xmm0
movss DWORD PTR [rsp-16], xmm1
movss DWORD PTR [rsp-12], xmm0
movq xmm0, QWORD PTR [rsp-16]
ret
Without it the code is
f(std::complex<float>):
sub rsp, 40
movq QWORD PTR [rsp+24], xmm0
movss xmm3, DWORD PTR [rsp+28]
movss xmm2, DWORD PTR [rsp+24]
movaps xmm1, xmm3
movaps xmm0, xmm2
call __mulsc3
movq QWORD PTR [rsp+16], xmm0
movss xmm0, DWORD PTR [rsp+16]
movss DWORD PTR [rsp+8], xmm0
movss xmm0, DWORD PTR [rsp+20]
movss DWORD PTR [rsp+12], xmm0
movq xmm0, QWORD PTR [rsp+8]
add rsp, 40
ret
and the __mulsc3
function is practically the same the standard C99 recommends for complex multiplication.
It includes the above mentioned checks.
1 Where the modulus of a number is extended from the real case |z| to the complex one ‖z‖, keeping the definition of infinite as the result of unbounded limits. Simply put, in the complex plane there is a whole circumference of infinite values and it takes just one "coordinate" to be infinite to get an infinite modulus.
2 The situation get worst if we remember that z = NaN + i∞ or z = ∞ + iNaN are valid infinite values
I get similar, but not identical, code from clang 3.8 at -O2 -march=core-avx2 -ffast-math
: I'm not deeply familiar with latter-day x86 floating point features, but I think it's doing the same calculation but using different instructions to shuffle values around in registers.
f:
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vaddss %xmm0, %xmm0, %xmm2
vmulss %xmm2, %xmm1, %xmm2
vmulss %xmm1, %xmm1, %xmm1
vfmsub231ss %xmm0, %xmm0, %xmm1
vinsertps $16, %xmm2, %xmm1, %xmm0 # xmm0 = xmm1[0],xmm2[0],xmm1[2,3]
retq
GCC 6.3, with the same options, again appears to do the same calculation but shuffle values around in yet a third way:
f:
vmovq %xmm0, -8(%rsp)
vmovss -4(%rsp), %xmm2
vmovss -8(%rsp), %xmm0
vmulss %xmm2, %xmm2, %xmm1
vfmsub231ss %xmm0, %xmm0, %xmm1
vmulss %xmm2, %xmm0, %xmm0
vmovss %xmm1, -16(%rsp)
vaddss %xmm0, %xmm0, %xmm0
vmovss %xmm0, -12(%rsp)
vmovq -16(%rsp), %xmm0
ret
Without -ffast-math
, both compilers generate substantially different code that does appear to check for NaN, at least.
I conclude from this that Intel's compiler is not generating a fully IEEE-compliant complex multiplication even with -fp-model strict
. There may be some other command-line switch that makes it generate fully IEEE-compliant code.
Whether or not this qualifies as a violation of C99 depends on whether Intel's compiler is documented to conform to Annex F and G (which specify what it means for an implementation of C to provide IEEE-compliant real and complex arithmetic), and if it is, what command-line options you have to give to get the conforming mode.