When I was porting some fortran code to c, it surprised me that the most of the execution time discrepancy between the fortran program compiled with ifort (intel fortran compiler) and the c program compiled with gcc, comes from the evaluations of trigonometric functions (sin
, cos
). It surprised me because I used to believe what this answer explains, that functions like sine and cosine are implemented in microcode inside microprocessors.
In order to spot the problem more explicitly I made a small test program in fortran
program ftest
implicit none
real(8) :: x
integer :: i
x = 0d0
do i = 1, 10000000
x = cos (2d0 * x)
end do
write (*,*) x
end program ftest
On intel Q6600
processor and 3.6.9-1-ARCH x86_64 Linux
I get with ifort version 12.1.0
$ ifort -o ftest ftest.f90
$ time ./ftest
-0.211417093282753
real 0m0.280s
user 0m0.273s
sys 0m0.003s
while with gcc version 4.7.2
I get
$ gfortran -o ftest ftest.f90
$ time ./ftest
0.16184945593939115
real 0m2.148s
user 0m2.090s
sys 0m0.003s
This is almost a factor of 10 difference! Can I still believe that the gcc implementation of cos
is a wrapper around the microprocessor implementation in a similar way as this is probably done in the intel implementation? If this is true, where is the bottle neck?
EDIT
According to comments, enabled optimizations should improve the performance. My opinion was that optimizations do not affect the library functions ... which does not mean that I don't use them in nontrivial programs. However, here are two additional benchmarks (now on my home computer intel core2
)
$ gfortran -o ftest ftest.f90
$ time ./ftest
0.16184945593939115
real 0m2.993s
user 0m2.986s
sys 0m0.000s
and
$ gfortran -Ofast -march=native -o ftest ftest.f90
$ time ./ftest
0.16184945593939115
real 0m2.967s
user 0m2.960s
sys 0m0.003s
Which particular optimizations did you (commentators) have in mind? And how can compiler exploit a multi-core processor in this particular example, where each iteration depends on the result of the previous one?
EDIT 2
The benchmark tests of Daniel Fisher and Ilmari Karonen made me think that the problem might be related to the particular version of gcc (4.7.2) and maybe to a particular build of it (Arch x86_64 Linux) that I am using on my computers. So I repeated the test on the intel core i7
box with debian x86_64 Linux
, gcc version 4.4.5
and ifort version 12.1.0
$ gfortran -O3 -o ftest ftest.f90
$ time ./ftest
0.16184945593939115
real 0m0.272s
user 0m0.268s
sys 0m0.004s
and
$ ifort -O3 -o ftest ftest.f90
$ time ./ftest
-0.211417093282753
real 0m0.178s
user 0m0.176s
sys 0m0.004s
For me this is a very much acceptable performance difference, which would never make me ask this question. It seems that I will have to ask on Arch Linux forums about this issue.
However, the explanation of the whole story is still very welcome.