I have a recursive descent parser for an embedded ARM processor (in C + GCC, for ARM Cortex M3).
While running it I've noticed that it uses a massive amount of stack space (even more than you might expect) and under closer inspection I have found that this is happening:
extern int bar(int *p);
int foo() {
int z = foo(); // it's an example!
int n[100]; // stack usage
return z+bar(n); // calling bar(n) stops n from being optimised out
}
Result of running arm-none-eabi-gcc -fomit-frame-pointer -S test.c
foo:
str lr, [sp, #-4]! ; Push link register
sub sp, sp, #412 ; Reserve space on stack, even if we don't need it now!
bl foo ; Recurse
str r0, [sp, #404] ; Store result
...
So at the start of the function, it pushes the entire stack frame onto the stack. However after a few iterations it's got loads of stuff on the stack that it hasn't used yet.
Ideally, what I'd like is for GCC to generate:
foo:
str lr, [sp, #-4]! ; Push link register
; Don't reserve space, because we don't need it
bl foo ; Recurse
sub sp, sp, #412 ; Reserve space now
str r0, [sp, #404] ; Store result
...
(This is probably not correct but I hope you get the idea)
Something a bit like this can be achieved with the following code, but it's really nasty (and if GCC inlines fooworker, it breaks again!). There must be a better way?
int fooworker(int z) {
int n[100]; // stack usage
return z+bar(n); // calling bar(n) stops n from being optimised out
}
int foo() {
return fooworker(foo());
}
So is there a way of telling GCC to only enlarge the stack at the start of the basic block, or is there a 'barrier' statement that causes extra push/pop ops to be added at that point? I guess GCC is using one of the ARM standard call types - but is there a way to tag these functions with another call type that is a bit more efficient with the stack, or is there a way to rewrite the functions such that the stack is used a bit more sensibly?
Please don't tell me not to use recursion, it is not answering the question.
This is all susceptible to optimization, but you could also try to introduce a new scope:
The introduction of the new scope means that
n
should not live beforefoo()
is called. Again, optimization could break all of this, like your own solution or the accepted one.It's ugly and I'd personally split up the function into two parts, but seems to work in my gcc on amd64 on all optimization levels.