Is gcc's asm volatile equivalent to the gfortr

I was just playing around with recursive functions in C++ and Fortran and I realised that a simple recursive function in Fortran is almost twice as fast as its equivalent C++ function. Now, before getting into this, I know there are similar questions here, specifically:

Why does adding assembly comments cause such radical change in generated code?
Working of asm volatile (“” : : : “memory”)
Equivalent to asm volatile in gfortran

However, I am a bit more specific and puzzled as the Fortran compiler seems to be doing what you can achieve with asm volatile in gcc. To give you some context let's consider the following recursive Fibonacci number implementation:

Fortran Code:

module test
implicit none
private
public fib

contains

! Fibonacci function
integer recursive function fib(n) result(r)
    integer, intent(in) :: n
    if (n < 2) then
        r = n
    else
        r = fib(n-1) + fib(n-2)
    end if
end function  ! end of Fibonacci function
end module

program fibonacci
use test, only: fib
implicit none
integer :: r,i 
integer :: n = 1e09
real(8) :: start, finish, cum_time

cum_time=0
do i= 1,n 
    call cpu_time(start)
    r = fib(20)
    call cpu_time(finish) 
    cum_time = cum_time + (finish - start)
    if (cum_time >0.5) exit
enddo  

print*,i,'runs, average elapsed time is', cum_time/i/1e-06, 'us' 
end program

Compiled with:

gfortran -O3 -march=native

C++ Code:

#include <iostream>
#include <chrono>
using namespace std;

// Fib function
int fib(const int n)
{
    int r;
    if (n < 2)
        r = n;
    else
        r = fib(n-1) + fib(n-2);
    return r;
} // end of fib

template<typename T, typename ... Args>
double timeit(T (*func)(Args...), Args...args)
{
    double counter = 1.0;
    double mean_time = 0.0;
    for (auto iter=0; iter<1e09; ++iter){
        std::chrono::time_point<std::chrono::system_clock> start, end;
        start = std::chrono::system_clock::now();

        func(args...);

        end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = end-start;

        mean_time += elapsed_seconds.count();
        counter++;

        if (mean_time > 0.5){
            mean_time /= counter;
            std::cout << static_cast<long int>(counter)
            << " runs, average elapsed time is "
            << mean_time/1.0e-06 << " \xC2\xB5s" << std::endl; 
            break;
        }
    }
    return mean_time;
}

int main(){
    timeit(fib,20);
    return 0;
}

Compiled with:

g++ -O3 -march=native

Timing:

Fortran: 24991 runs, average elapsed time is 20.087 us
C++    : 12355 runs, average elapsed time is 40.471 µs

So gfortran is twice as fast there compared to gcc. Looking at the assembly code, I get

Assembly (Fortran):

.L28:
    cmpl    $1, %r13d
    jle .L29
    leal    -8(%rbx), %eax
    movl    %ecx, 12(%rsp)
    movl    %eax, 48(%rsp)
    leaq    48(%rsp), %rdi
    leal    -9(%rbx), %eax
    movl    %eax, 16(%rsp)
    call    __bench_MOD_fib
    leaq    16(%rsp), %rdi
    movl    %eax, %r13d
    call    __bench_MOD_fib
    movl    12(%rsp), %ecx
    addl    %eax, %r13d

Assembly (C++):

.L28:
    movl    72(%rsp), %edx
    cmpl    $1, %edx
    movl    %edx, %eax
    jle .L33
    subl    $3, %eax
    movl    $0, 52(%rsp)
    movl    %eax, %esi
    movl    %eax, 96(%rsp)
    movl    92(%rsp), %eax
    shrl    %eax
    movl    %eax, 128(%rsp)
    addl    %eax, %eax
    subl    %eax, %esi
    movl    %edx, %eax
    subl    $1, %eax
    movl    %esi, 124(%rsp)
    movl    %eax, 76(%rsp)

Both assembly codes are made up of almost similar blocks/labels repeated over and over again. As you can see the Fortran assembly makes two calls to fib function whereas in the C++ assembly, gcc has probably unrolled all the recursive calls which probably requires more stack push/pop and tail jumps.

Now if I just put one inline assembly comment in the C++ code like so

Modified C++ Code:

// Fib function
int fib(const int n)
{
    int r;
    if (n < 2)
        r = n;
    else
        r = fib(n-1) + fib(n-2);
    asm("");
    return r;
} // end of fib

The generated assembly code, changes to

Assembly (C++ Modified):

.L7:
    cmpl    $1, %edx
    jle .L17
    leal    -4(%rbx), %r13d
    leal    -5(%rbx), %edx
    cmpl    $1, %r13d
    jle .L19
    leal    -5(%rbx), %r14d
    cmpl    $1, %r14d
    jle .L55
    leal    -6(%rbx), %r13d
    movl    %r13d, %edi
    call    _Z3fibi
    leal    -7(%rbx), %edi
    movl    %eax, %r15d
    call    _Z3fibi
    movl    %r13d, %edi
    addl    %eax, %r15d

You can now see two calls to fib function. Timing them gives me

Timing:

Fortran: 24991 runs, average elapsed time is 20.087 us
C++    : 25757 runs, average elapsed time is 19.412 µs

I know the effect of asm with no output and asm volatile is to suppress aggressive compiler optimisations but in this case, gcc thought it was too smart but ended up generating a less efficient code in the first place.

So the question is:

Why can gcc not see this "optimisation", when gfortan clearly can?
The inline assembly line has to be before the return statement. Put it elsewhere and it will have no effect. Why?
Is this behaviour compiler specific? For instance can you mimick the same behaviour with clang/MSVC?
Are there safer ways to make recursions faster in C or C++ (without relying on inline assembly or iteration-style coding)? Variadic templates perhaps?

UPDATE:

The result shown above are all with gcc 4.8.4. I have also tried compiling it with gcc 4.9.2 and gcc 5.2 and I get identical results.
The problem can also be replicated (fixed?) if instead of putting asm I declare the input argument as volatile i.e. (volatile int n) instead of (const int n), although this will result in a tad bit slower run-time on my machine.
As Michael Karcher has mentioned, we can instead pass the -fno-optimize-sibling-calls flag to fix this problem. Since this flag is activated at -O2 level and beyond, even compiling with -O1 fixes this problem.
I have run the same example with clang 3.5.1 with -O3 -march=native and though the situation is not identical, clang also appears to generate faster code with asm.

Clang Timing:

clang++ w/o asm    :  8846 runs, average elapsed time is 56.4555 µs
clang++ with asm   : 10427 runs, average elapsed time is 47.8991 µs

See the bold print near the end of this answer on how to get a fast program generated by gcc. Read the answer for replies to the four questions.

Your first question assumes that gfortran is able to see an optimization possibility that gcc failed to see. In fact, the opposite is true. gcc identified something that it considered to be an optimization possibility, while gfortran missed it. Alas, gcc was wrong and the optimization it applied turns out to be a 100% speed loss on your system (comparable on mine).

To address your second question: The asm statement prevented an internal transformation that made gcc see the false optimization possiblity. Without the asm statement, your code got (validly) transformed to:

int fib(const int n)
{
    if (n < 2)
        return n;
    else
        return fib(n-1) + fib(n-2);
}

The return statement containing the recursive call triggers the "sibling calls optimization" that pessimizes your code. Including the asm statement prevents moving the return instruction across it.

Currently, I only have gcc at hand, so I can't try the behaviour of other compilers to answer your third question by evidence, but this seems definitely compiler dependent. You ran into a quirk (or bug, however you call it) of gcc that it generates bad code while trying to optimize it. The optimizers of different compilers are highly different, so it is quite likely that some other compiler doesn't mis-optimize your code like gcc does. On the other hand, code transformations for optimization is a well-researched topic, and most compilers are implementing similar approaches to optimization, so it is possible that another compiler steps into the same trap as gcc.

To address you last question: It is not a problem about C/C++ versus Fortan, but a problem about gcc that messes up this example program (and likely similar production programs). So there is no way to make recursion faster in C++, but there is a way to speed up this example in gcc, by disabling the problematic optimization: -fno-optimize-sibling-calls, which results (on my system, in a single test run) in even faster code than just inserting the asm statement.