It is great that gcc compiler 4.8 comes with AVX optimization with -Ofast option. However, I found an interesting but stupid bug, that it adds additional computations which are unnecessary. Maybe I am wrong so can someone give me an explanation?
The original C++ source code is as follows:
#define N 1000007
float a[N],b[N],c[N],d[N],e[N];
int main(int argc, char *argv[]){
cout << a << ' ' << b << ' ' << c << endl;
for(int x=0; x<N; ++x){
c[x] = 1/sqrt((a[x]+b[x]-c[x])*d[x]/e[x]);
}
return 0;
}
The code is compiled using g++ 4.8.4 in Ubuntu 14.04.3 x86_64: g++ -mavx avx.cpp -masm=intel -c -g -Wa,-ahl=avx.asm -Ofast
The assembly source code is as follows:
90 .LVL10:
91 006b C5FC2825 vmovaps ymm4, YMMWORD PTR .LC0[rip]
91 00000000
92 0073 31C0 xor eax, eax
93 0075 C5FC281D vmovaps ymm3, YMMWORD PTR .LC1[rip]
25:avx.cpp **** for(int x=0; x<N; ++x){
26:avx.cpp **** c[x] = 1/sqrt((a[x]+b[x]-c[x])*d[x]/e[x]);
101 .loc 1 26 0 discriminator 2
102 0080 C5FC2890 vmovaps ymm2, YMMWORD PTR b[rax]
102 00000000
103 0088 4883C020 add rax, 32
104 008c C5FC2888 vmovaps ymm1, YMMWORD PTR e[rax-32]
104 00000000
105 0094 C5EC5890 vaddps ymm2, ymm2, YMMWORD PTR a[rax-32]
105 00000000
106 009c C5FC53C1 vrcpps ymm0, ymm1
107 00a0 C5FC59C9 vmulps ymm1, ymm0, ymm1
108 00a4 C5FC59C9 vmulps ymm1, ymm0, ymm1
109 00a8 C5EC5C90 vsubps ymm2, ymm2, YMMWORD PTR c[rax-32]
109 00000000
110 00b0 C5FC58C0 vaddps ymm0, ymm0, ymm0
111 00b4 C5EC5990 vmulps ymm2, ymm2, YMMWORD PTR d[rax-32]
111 00000000
112 00bc C5FC5CC9 vsubps ymm1, ymm0, ymm1
113 00c0 C5EC59C1 vmulps ymm0, ymm2, ymm1
118 00c4 C5FC52C8 vrsqrtps ymm1, ymm0
119 00c8 C5F459C0 vmulps ymm0, ymm1, ymm0
120 00cc C5FC59C1 vmulps ymm0, ymm0, ymm1
121 00d0 C5F459CB vmulps ymm1, ymm1, ymm3
122 00d4 C5FC58C4 vaddps ymm0, ymm0, ymm4
^LGAS LISTING /tmp/ccJtIFtg.s page 21
123 00d8 C5FC59C9 vmulps ymm1, ymm0, ymm1
124 .LBE45:
125 .LBE44:
126 .loc 1 26 0 discriminator 2
127 00dc C5FC2988 vmovaps YMMWORD PTR c[rax-32], ymm1
127 00000000
128 00e4 483D0009 cmp rax, 4000000
128 3D00
129 00ea 7594 jne .L3
Now look at line 106, 107, 108, 110, 112 and 113.
The compiler computes the division by e[x] using the multiplication by its inverse. So Line 106 computes 1/e[x], which is correct. After that it can directly multiply this with the final product of (a[x]+b[x]-c[x])*d[x], which is stored in ymm2, Line 111. However, instead of doing this, the compiler did something interesting and ridiculous:
it first multiplies the computed reciprocal 1/e[x] with e[x] to obtain 1 (Line 107)
then multiply this 1 with 1/e[x] to obtain back 1/e[x] (Line 108)
then it adds 1/e[x] to itself to obtain 2/e[x] (Line 110)
then it subtracts 2/e[x] by 1/e[x] to obtain back 1/e[x] (Line 112)
After that, the compiler is ingenious to use the vrsqrtps instruction to compute 1/sqrt(). However, after that, what happens? Instead of extracting the output in ymm1 (Line 118), it did something even more fanciful again:
it first multiplies 1/sqrt(x) by x to obtain sqrt(x), (Line 119)
it then multiplies the sqrt(x) by 1/sqrt(x) to obtain back 1, (Line 120)
it then multiplies 1/sqrt(x) by 1 (pre-stored in ymm3) to obtain the same 1/sqrt(x), (Line 121)
it then adds the obtained 1 by 0 (pre-stored in ymm4) to obtain 1, (Line 122)
it then multiplies 1/sqrt(x) with the obtained 1 to obtain back the same 1/sqrt(x), (Line 123)
The above two redundancies show that whenever 1/x is required, the compiler tends to multiply the already obtained output with the original number to obtain back 1, and then multiply this 1 with the already obtained output to obtain back the same output. Is there any reason for doing this? Or it is just a bug?