I wrote the program Fibonacci number calculation in compile time (constexpr) problem using the template metaprogramming techniques supported in C++11. The purpose of this is to calculate the difference in the run-time between the template metaprogramming approach and the old conventional approach.
// Template Metaprograming Approach
template<int N>
constexpr int fibonacci() {return fibonacci<N-1>() + fibonacci<N-2>(); }
template<>
constexpr int fibonacci<1>() { return 1; }
template<>
constexpr int fibonacci<0>() { return 0; }
// Conventional Approach
int fibonacci(int N) {
if ( N == 0 ) return 0;
else if ( N == 1 ) return 1;
else
return (fibonacci(N-1) + fibonacci(N-2));
}
I ran both programs for N = 40 on my GNU/Linux system and measured the time and found that that conventional solution (1.15 second) is around two times slower than the template-based solution (0.55 second). This is a significant improvement as both approaches are based on the recursion.
To understand it more I compiled the program (-fdump-tree-all flag) in g++ and found that compiler actually generated the 40 different functions (like fibonacci<40>, fibonacci<39>...fibonacci<0>).
constexpr int fibonacci() [with int N = 40] () {
int D.29948, D.29949, D.29950;
D.29949 = fibonacci<39> ();
D.29950 = fibonacci<38> ();
D.29948 = D.29949 + D.29950;
return D.29948;
}
constexpr int fibonacci() [with int N = 39] () {
int D.29952, D.29953, D.29954;
D.29953 = fibonacci<38> ();
D.29954 = fibonacci<37> ();
D.29952 = D.29953 + D.29954;
return D.29952;
}
...
...
...
constexpr int fibonacci() [with int N = 0] () {
int D.29962;
D.29962 = 0;
return D.29962;
}
I also debugged the program in GDB and found that all the above functions are executed an equal number of times as with the conventional recursive approach. If both versions of the program are executing the function an equal number of times (recursive), then how is this achieved by template metaprogramming techniques? I would also like to know your opinion about how and why a template metaprogramming based approach is taking half time compared to the other version? Can this program be made faster than the current one?
Basically my intention here is to understand what's going on internally as much as possible.
My machine is GNU/Linux with GCC 4.8.1, and I used the optimization -o3
for both programs.
Try this:
With clang and
-Os
, this compiles in roughly 0.5s and runs in zero time forN=40
. Your "conventional" approach compiles in roughly 0.4s and runs in 0.8s. Just for checking, the result is102334155
right?When I tried your own
constexpr
solution the compiler run for a couple of minutes and then I stopped it because apparently memory was full (computer started freezing). The compiler was trying to compute the final result and your implementation is extremely inefficient to be used at compile time.With this solution, template instantiations at
N-2
,N-1
are re-used when instantiatingN
. Sofibonacci<40>
is actually known at compile time as a value, and there is nothing to do at run-time. This is a dynamic programming approach and of course you can do the same at run time if you store all values at0
throughN-1
before computing atN
.With your solution, the compiler can evaluate
fibonacci<N>()
at compile time but is not required to. In your case, all or part of computation is left for run time. In my case, all computation is attempted at compile time, hence never ending.Adding -O1 (or higher) to GCC4.8.1 will make fibonacci<40>() a compile time constant and all the template generated code will disappear from your assembly. The following code
will result in the assembly output
This gives the best runtime performance.
However, it looks like you are building without optimizations (-O0) so you get something quite a bit different. The assembly output for each of the 40 fibonacci functions look basically identical (except for the 0 and 1 cases)
This is straight forward, it sets up the stack, calls the two other fibonacci functions, adds the value, tears down the stack, and returns. No branching, and no comparisons.
Now compare that with the assembly from the conventional approach
Each time the function is called it needs to do check if N is 0 or 1 and act appropriately. This comparison is not needed in the template version because it is built into the function via the magic of templates. My guess is that the un-optimized version of the template code is faster because you avoid those comparisons and would also not have any missed branch predictions.
Maybe just use a more efficient algorithm?
My code is based on an idea described by D. Knuth in the first part of his "The Art of Computer Programming". I can't remember the exact place in this book, but I'm sure that the algorithm was described there.
The reason is that your runtime solution is not optimal. For every fib number, functions are called several times. The fibonacci sequence, has overlapping subproblems, so for example
fib(6)
callsfib(4)
, andfib(5)
also callsfib(4)
.The template based approach, uses (inadvertently) a Dynamic Programming approach, meaning that it stores values for previously calculated numbers, avoiding repetition. So, when
fib(5)
callsfib(4)
, the number was already calculated whenfib(6)
did.I recommend looking up "dynamic programming fibonacci" and trying that, it should speed things up dramatically.