So I have been having a look at some of the magic that is O3
in GCC (well actually I'm compiling using Clang but it's the same with GCC and I'm guessing a large part of the optimiser was pulled over from GCC to Clang).
Consider this C program:
int foo(int n) {
if (n == 0) return 1;
return n * foo(n-1);
}
int main() {
return foo(10);
}
The first thing I was pretty WOW-ed at (which was also WOW-ed at in this question - https://stackoverflow.com/a/414774/1068248) was how int foo(int)
(a basic factorial function) compiles into a tight loop. This is the ARM assembly for it:
.globl _foo
.align 2
.code 16
.thumb_func _foo
_foo:
mov r1, r0
movs r0, #1
cbz r1, LBB0_2
LBB0_1:
muls r0, r1, r0
subs r1, #1
bne LBB0_1
LBB0_2:
bx lr
Blimey I thought. That's pretty interesting! Completely tight looping to do the factorial. WOW. It's not a tail call optimisation since, well, it's not a tail call. But it appears to have done a much similar optimisation.
Now look at main
:
.globl _main
.align 2
.code 16
.thumb_func _main
_main:
movw r0, #24320
movt r0, #55
bx lr
That just blew my mind to be honest. It's just totally bypassing foo
and returning 3628800
which is 10!
.
This makes me really realise how your compiler can often do a much better job than you can at optimising your code. But it raises the question, how does it manage to do such a good job? So, can anyone explain (possibly by linking to relevant code) how the following optimisations work:
The initial
foo
optimisation to be a tight loop.The optimisation where
main
just goes and returns the result directly rather than actually executingfoo
.
Also another interesting side effect of this question would be to show some more interesting optimisations which GCC/Clang can do.