Handling stack overflows in embedded systems

2019-03-13 15:09发布

问题:

In embedded software, how do you handle a stack overflow in a generic way? I come across some processor which does protect in hardware way like recent AMD processors. There are some techniques on Wikipedia, but are those real practical approaches?

Can anybody give a clear suggested approach which works in all case on today's 32-bit embedded processors?

回答1:

Ideally you write your code with static stack usage (no recursive calls). Then you can evaluate maximum stack usage by:

  1. static analysis (using tools)
  2. measurement of stack usage while running your code with complete code coverage (or as high as possible code coverage until you have a reasonable confidence you've established the extent of stack usage, as long as your rarely-run code doesn't use particularly more stack than the normal execution paths)

But even with that, you still want to have a means of detecting and then handling stack overflow if it occurs, if at all possible, for more robustness. This can be especially helpful during the project's development phase. Some methods to detect overflow:

  1. If the processor supports a memory read/write interrupt (i.e. memory access breakpoint interrupt) then it can be configured to point to the furthest extent of the stack area.
  2. In the memory map configuration, set up a small (or large) block of RAM that is a "stack guard" area. Fill it with known values. In the embedded software, regularly (as often as reasonably possible) check the contents of this area. If it ever changes, assume a stack overflow.

Once you've detected it, then you need to handle it. I don't know of many ways that code can gracefully recover from a stack overflow, because once it's happened, your program logic is almost certainly invalidated. So all you can do is

  1. log the error
    1. Logging the error is very useful, because otherwise the symptoms (unexpected reboots) can be very hard to diagnose.
    2. Caveat: The logging routine must be able to run reliably even in a corrupted-stack scenario. The routine should be simple. I.e. with a corrupted stack, you probably can't try to write to EEPROM using your fancy EEPROM writing background task. Maybe just log the error into a struct that is reserved for this purpose, in non-init RAM, which can then be checked after reboot.
  2. Reboot (or perhaps shutdown, especially if the error reoccurs repeatedly)
    1. Possible alternative: restart just the particular task, if you're using an RTOS, and your system is designed so the stack corruption is isolated, and all the other tasks are able to handle that task restarting. This would take some serious design consideration.


回答2:

While embedded stack overflow can be caused by recursive functions getting out of hand, it can also be caused by errant pointer usage (although this could be considered another type of error), and normal system operation with an undersized stack. In other words, if you don't profile your stack usage it can occur outside of a defect or bug situation.

Before you can "handle" stack overflow you have to identify it. A good method for doing this is to load the stack with a pattern during initialization and then monitor how much of the pattern disappears during run-time. In this fashion you can identify the highest point the stack has reached.

The pattern check algorithm should execute in the opposite direction of stack growth. So, if the stack grows from 0x1000 to 0x2000, then your pattern check can start at 0x2000 to increase efficiency. If your pattern was 0xAA and the value at 0x2000 contains something other than 0xAA, you know you've probably got some overflow.

You should also consider placing an empty RAM buffer immediately after the stack so that if you do detect overflow you can shut down the system without losing data. If your stack is followed immediately by heap or SRAM data then identifying an overflow will mean that you have already suffered corruption. Your buffer will protect you for a little bit longer. On a 32-bit micro you should have enough RAM to provide at least a small buffer.



回答3:

A stack overflow occurs the stack memory is exhausted by too large of a call stack ? e.g. a recursive function too many levels deep.

There are techniques to detect a stack overflow by placing known data after the stack so it could be detected if the stack grew too much and overwrote it.

There are static source code analysis tools such as GnatStack, StackAnalyzer from AbsInt, and Bound-T which can be used to determine or make a guess at the maximum run-time stack-size.



回答4:

If you are using a processor with a Memory Management Unit your hardware can do this for you with minimal software overhead. Most modern 32 bit processors have them and more and more 32 bit micro controllers feature them as well.

Set up a memory area in the MMU that will be used for the stack. It should be bordered by two memory areas where the MMU does not allow access. When your application is running you will receive a exception/interrupt as soon as you overflow the stack.

Because you get a exception at the moment the error occur you know exactly where in your application the stack went bad. You can look at the call stack to see exactly how you got to where you are. This makes it a lot easier to find your problem than trying to figure out what is wrong by detecting your problem long after it happened.

I have used this successfully on PPC and AVR32 processors. When you start out using an MMU you feel like it is a waste of time since you got along great without it for many years but once you see the advantages of a exception at the exact spot where your memory problem occur you will never go back. A MMU can also detect zero pointer accesses if you disallow memory access to the bottom park of your ram.

If you are using an RTOS your MMU protects the memory and stacks of other tasks errors in one task should not affect them. This means you could also easily restart your task without affecting the other tasks.

In addition to this a processor with a MMU usually also has lots of ram your program is a lot less likely to overflow your stack and you don't need to fine tune everything to get you application to run correctly with a small memory foot print.

An alternative to this would be to use the Processor debug facilities to cause a interrupt on a memory access to the end of your stack. This will probably be very processor specific.