I was just playing around with recursive functions in C++
and Fortran
and I realised that a simple recursive function in Fortran
is almost twice as fast as its equivalent C++
function. Now, before getting into this, I know there are similar questions here, specifically:
- Why does adding assembly comments cause such radical change in generated code?
- Working of asm volatile (“” : : : “memory”)
- Equivalent to asm volatile in gfortran
However, I am a bit more specific and puzzled as the Fortran compiler seems to be doing what you can achieve with asm volatile
in gcc
. To give you some context let's consider the following recursive Fibonacci number
implementation:
Fortran Code:
module test
implicit none
private
public fib
contains
! Fibonacci function
integer recursive function fib(n) result(r)
integer, intent(in) :: n
if (n < 2) then
r = n
else
r = fib(n-1) + fib(n-2)
end if
end function ! end of Fibonacci function
end module
program fibonacci
use test, only: fib
implicit none
integer :: r,i
integer :: n = 1e09
real(8) :: start, finish, cum_time
cum_time=0
do i= 1,n
call cpu_time(start)
r = fib(20)
call cpu_time(finish)
cum_time = cum_time + (finish - start)
if (cum_time >0.5) exit
enddo
print*,i,'runs, average elapsed time is', cum_time/i/1e-06, 'us'
end program
Compiled with:
gfortran -O3 -march=native
C++ Code:
#include <iostream>
#include <chrono>
using namespace std;
// Fib function
int fib(const int n)
{
int r;
if (n < 2)
r = n;
else
r = fib(n-1) + fib(n-2);
return r;
} // end of fib
template<typename T, typename ... Args>
double timeit(T (*func)(Args...), Args...args)
{
double counter = 1.0;
double mean_time = 0.0;
for (auto iter=0; iter<1e09; ++iter){
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
func(args...);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
mean_time += elapsed_seconds.count();
counter++;
if (mean_time > 0.5){
mean_time /= counter;
std::cout << static_cast<long int>(counter)
<< " runs, average elapsed time is "
<< mean_time/1.0e-06 << " \xC2\xB5s" << std::endl;
break;
}
}
return mean_time;
}
int main(){
timeit(fib,20);
return 0;
}
Compiled with:
g++ -O3 -march=native
Timing:
Fortran: 24991 runs, average elapsed time is 20.087 us
C++ : 12355 runs, average elapsed time is 40.471 µs
So gfortran
is twice as fast there compared to gcc
. Looking at the assembly code, I get
Assembly (Fortran):
.L28:
cmpl $1, %r13d
jle .L29
leal -8(%rbx), %eax
movl %ecx, 12(%rsp)
movl %eax, 48(%rsp)
leaq 48(%rsp), %rdi
leal -9(%rbx), %eax
movl %eax, 16(%rsp)
call __bench_MOD_fib
leaq 16(%rsp), %rdi
movl %eax, %r13d
call __bench_MOD_fib
movl 12(%rsp), %ecx
addl %eax, %r13d
Assembly (C++):
.L28:
movl 72(%rsp), %edx
cmpl $1, %edx
movl %edx, %eax
jle .L33
subl $3, %eax
movl $0, 52(%rsp)
movl %eax, %esi
movl %eax, 96(%rsp)
movl 92(%rsp), %eax
shrl %eax
movl %eax, 128(%rsp)
addl %eax, %eax
subl %eax, %esi
movl %edx, %eax
subl $1, %eax
movl %esi, 124(%rsp)
movl %eax, 76(%rsp)
Both assembly codes are made up of almost similar blocks/labels repeated over and over again. As you can see the Fortran assembly makes two calls to fib
function whereas in the C++ assembly, gcc
has probably unrolled all the recursive calls which probably requires more stack push/pop
and tail jumps.
Now if I just put one inline assembly comment in the C++ code like so
Modified C++ Code:
// Fib function
int fib(const int n)
{
int r;
if (n < 2)
r = n;
else
r = fib(n-1) + fib(n-2);
asm("");
return r;
} // end of fib
The generated assembly code, changes to
Assembly (C++ Modified):
.L7:
cmpl $1, %edx
jle .L17
leal -4(%rbx), %r13d
leal -5(%rbx), %edx
cmpl $1, %r13d
jle .L19
leal -5(%rbx), %r14d
cmpl $1, %r14d
jle .L55
leal -6(%rbx), %r13d
movl %r13d, %edi
call _Z3fibi
leal -7(%rbx), %edi
movl %eax, %r15d
call _Z3fibi
movl %r13d, %edi
addl %eax, %r15d
You can now see two calls to fib
function. Timing them gives me
Timing:
Fortran: 24991 runs, average elapsed time is 20.087 us
C++ : 25757 runs, average elapsed time is 19.412 µs
I know the effect of asm
with no output and asm volatile
is to suppress aggressive compiler optimisations but in this case, gcc
thought it was too smart but ended up generating a less efficient code in the first place.
So the question is:
- Why can
gcc
not see this "optimisation", whengfortan
clearly can? - The inline assembly line has to be before the return statement. Put it elsewhere and it will have no effect. Why?
- Is this behaviour compiler specific? For instance can you mimick the same behaviour with clang/MSVC?
- Are there safer ways to make recursions faster in
C
orC++
(without relying on inline assembly or iteration-style coding)? Variadic templates perhaps?
UPDATE:
- The result shown above are all with
gcc 4.8.4
. I have also tried compiling it withgcc 4.9.2
andgcc 5.2
and I get identical results. - The problem can also be replicated (fixed?) if instead of putting
asm
I declare the input argument as volatile i.e.(volatile int n)
instead of(const int n)
, although this will result in a tad bit slower run-time on my machine. - As Michael Karcher has mentioned, we can instead pass the
-fno-optimize-sibling-calls
flag to fix this problem. Since this flag is activated at-O2
level and beyond, even compiling with-O1
fixes this problem. - I have run the same example with
clang 3.5.1
with-O3 -march=native
and though the situation is not identical,clang
also appears to generate faster code withasm
.
Clang Timing:
clang++ w/o asm : 8846 runs, average elapsed time is 56.4555 µs
clang++ with asm : 10427 runs, average elapsed time is 47.8991 µs