Reducing stack usage during recursion with GCC + A

2019-06-20 14:17发布

问题:

I have a recursive descent parser for an embedded ARM processor (in C + GCC, for ARM Cortex M3).

While running it I've noticed that it uses a massive amount of stack space (even more than you might expect) and under closer inspection I have found that this is happening:

extern int bar(int *p);

int foo() {
 int z = foo(); // it's an example!

 int n[100];  // stack usage
 return z+bar(n); // calling bar(n) stops n from being optimised out
}

Result of running arm-none-eabi-gcc -fomit-frame-pointer -S test.c

foo:
    str lr, [sp, #-4]!  ; Push link register
    sub sp, sp, #412    ; Reserve space on stack, even if we don't need it now!
    bl  foo             ; Recurse
    str r0, [sp, #404]  ; Store result
    ...

So at the start of the function, it pushes the entire stack frame onto the stack. However after a few iterations it's got loads of stuff on the stack that it hasn't used yet.

Ideally, what I'd like is for GCC to generate:

foo:
    str lr, [sp, #-4]!  ; Push link register
    ; Don't reserve space, because we don't need it
    bl  foo             ; Recurse
    sub sp, sp, #412    ; Reserve space now
    str r0, [sp, #404]  ; Store result
    ...

(This is probably not correct but I hope you get the idea)

Something a bit like this can be achieved with the following code, but it's really nasty (and if GCC inlines fooworker, it breaks again!). There must be a better way?

int fooworker(int z) {
 int n[100];  // stack usage
 return z+bar(n); // calling bar(n) stops n from being optimised out
}


int foo() {
 return fooworker(foo());
}

So is there a way of telling GCC to only enlarge the stack at the start of the basic block, or is there a 'barrier' statement that causes extra push/pop ops to be added at that point? I guess GCC is using one of the ARM standard call types - but is there a way to tag these functions with another call type that is a bit more efficient with the stack, or is there a way to rewrite the functions such that the stack is used a bit more sensibly?

Please don't tell me not to use recursion, it is not answering the question.

回答1:

int *n = alloca(sizeof(*n) * 100);

It's ugly and I'd personally split up the function into two parts, but seems to work in my gcc on amd64 on all optimization levels.



回答2:

This is all susceptible to optimization, but you could also try to introduce a new scope:

extern int bar(int *p);

int foo() {
  int z = foo();

  {
    int n[100];
    return z+bar(n);
  }
}

The introduction of the new scope means that n should not live before foo() is called. Again, optimization could break all of this, like your own solution or the accepted one.