This question may sound fairly elementary, but this is a debate I had with another developer I work with.
I was taking care to stack allocate things where I could, instead of heap allocating them. He was talking to me and watching over my shoulder and commented that it wasn't necessary because they are the same performance wise.
I was always under the impression that growing the stack was constant time, and heap allocation's performance depended on the current complexity of the heap for both allocation (finding a hole of the proper size) and de-allocating (collapsing holes to reduce fragmentation, as many standard library implementations take time to do this during deletes if I am not mistaken).
This strikes me as something that would probably be very compiler dependent. For this project in particular I am using a Metrowerks compiler for the PPC architecture. Insight on this combination would be most helpful, but in general, for GCC, and MSVC++, what is the case? Is heap allocation not as high performing as stack allocation? Is there no difference? Or are the differences so minute it becomes pointless micro-optimization.
Concerns Specific to the C++ Language
First of all, there is no so-called "stack" or "heap" allocation mandated by C++. If you are talking about automatic objects in the block scope, they are even not "allocated". (BTW, automatic storage duration in C is definitely NOT the same to "allocated"; the latter is "dynamic" in the C++ parlance.) And the dynamically allocated memory is on the free store, not necessarily on "the heap", though the latter is often the (default) implementation.
Although as per the abstract semantic rules, automatic objects still occupy memory, a conforming C++ implementation is allowed to ignore this fact when it can prove this does not matter (when it does not change the observable behavior of the program). This permission is granted by the as-if rule in ISO C++, which is also the general clause enabling the usual optimizations (and there is also an almost same rule in ISO C). Besides the as-if rule, ISO C++ also has copy elision rules to allow omission of specific creations of objects. The constructor and destructor calls involved are thereby omitted. As a result, the automatic objects (if any) in these constructors and destructors are also eliminated, compared to naive abstract semantics implied by the source code.
On the other hand, free store allocation is definitely "allocation" by design. Under ISO C++ rules, such an allocation can be achieved by a call of an allocation function. However, since ISO C++14, there is a new (non-as-if) rule to allow merging global allocation function (i.e.
::operator new
) calls in specific cases. So parts of dynamic allocation operations can also be no-op like the case of automatic objects.Allocation functions allocate resources of memory. Objects can be further allocated based on allocation using allocators. For automatic objects, they are directly presented - although the underlying memory can be accessed and be used to provide memory to other objects (by placement
new
), but this does not make great sense as the free store, because there is no way to move the resources elsewhere.All other concerns are out of the scope of C++. Nevertheless, they can be still significant.
About Implementations of C++
C++ does not expose reified activation records or some sorts of first-class continuations (e.g. by the famous
call/cc
), there is no way to directly manipulate the activation record frames - where the implementation need to place the automatic objects to. Once there is no (non-portable) interoperations with the underlying implementation ("native" non-portable code, such as inline assembly code), an omission of the underlying allocation of the frames can be quite trivial. For example, when the called function is inlined, the frames can be effectively merged into others, so there is no way to show what is the "allocation".However, once interops are respected, things are getting complex. A typical implementation of C++ will expose the ability of interop on ISA (instruction-set architecture) with some calling conventions as the binary boundary shared with the native (ISA-level machine) code. This would be explicitly costly, notably, when maintaining the stack pointer, which is often directly held by an ISA-level register (with probably specific machine instructions to access). The stack pointer indicates the boundary of the top frame of the (currently active) function call. When a function call is entered, a new frame is needed and the stack pointer is added or subtracted (depending on the convention of ISA) by a value not less than the required frame size. The frame is then said allocated when the stack pointer after the operations. Parameters of functions may be passed onto the stack frame as well, depending on the calling convention used for the call. The frame can hold the memory of automatic objects (probably including the parameters) specified by the C++ source code. In the sense of such implementations, these objects are "allocated". When the control exits the function call, the frame is no longer needed, it is usually released by restoring the stack pointer back to the state before the call (saved previously according to the calling convention). This can be viewed as "deallocation". These operations makes the activation record effectively a LIFO data structure, so it is often called "the (call) stack". The stack pointer effectively indicates the top position of the stack.
Because most C++ implementations (particularly the ones targeting ISA-level native code and using the assembly language as its immediate output) uses similar strategies like this, such confusing "allocation" scheme is popular. Such allocations (as well as deallocations) do spend machine cycles, and it can be expensive when the (non-optimized) calls occur frequently, even though modern CPU microarchitectures can have complex optimizations implemented by hardware for the common code pattern (like using a stack engine in implementing
PUSH
/POP
instructions).But anyway, in general, it is true that the cost of stack frame allocation is significantly less than a call to an allocation function operating the free store (unless it is totally optimized away), which itself can have hundreds of (if not millions of :-) operations to maintain the stack pointer and other states. Allocation functions are typically based on API provided by the hosted environment (e.g. runtime provided by the OS). Different to the purpose of holding automatic objects for functions calls, such allocations are general-purposed, so they will not have frame structure like a stack. Traditionally, they allocate space from the pool storage called heap (or several heaps). Different to the "stack", the concept "heap" here does not indicate the data structure being used; it is derived from early language implementations decades ago. (BTW, the call stack is usually allocated with fixed or user-specified size from the heap by the environment in program or thread startup.) The nature of use cases makes allocations and deallocations from a heap far more complicated (than push or pop of stack frames), and hardly possible to be directly optimized by hardware.
Effects on Memory Access
The usual stack allocation always put the new frame on the top, so it has a quite good locality. This is friendly to cache. OTOH, memory allocated randomly in the free store has no such property. Since ISO C++17, there are pool resource templates provided by
<memory>
. The direct purpose of such interface is to allow the results of consecutive allocations being close together in memory. This acknowledges the fact that this strategy is generally good for performance with contemporary implementations, e.g. being friendly to cache in modern architectures. This is about the performance of access rather than allocation, though.Concurrency
Expectation of concurrent access of memory can have different effects between the stack and heaps. A call stack is usually exclusively owned by one thread of execution in a C++ implementation. OTOH, heaps are often shared among the threads in a process. For such heaps, the allocation and deallocation functions have to protect the shared internal administrative data structure from data race. As a result, heap allocations and deallocations may have additional overhead due to internal synchronization operations.
Space Efficiency
Due to the nature of the use cases and internal data structures, heaps may suffers from internal memory fragmentation, while the stack does not. This does not have direct impacts on performance of memory allocation, but in a system with virtual memory, low space efficiency may degenerate overall performance of memory access. This is particularly awful when HDD is used as swap of physical memory. It can cause quite long latency - sometimes billions of cycles.
Limitations of Stack Allocations
Although stack allocations are often superior in performance than heap allocations in reality, it certainly does not mean stack allocations can always replace heap allocations.
First, there is no way to allocate space on the stack with a size specified at runtime in a portable way with ISO C++. There are extensions provided by implementations like
alloca
and G++'s VLA (variable-length array), but there are reasons to avoid use them. (IIRC, Linux source removes use of VLA recently.) (Also note ISO C99 does have VLA, but ISO C11 turns the support optional.)Second, there is no reliable and portable way to detect stack space exhaustion. This is often called stack overflow (hmm, etymology of this site), but probably more accruately, "stack overrun". In reality this often causes invalid access of memory and the state of the program is then corruptied (...or maybe worse, a security hole). In fact, ISO C++ has no concept of stack and makes it undefined behavior when the resource is exhausted. Be cautious about how much room should be left for automatic objects.
If the stack space run out, there are too many object allocated in the stack, which can be caused by too many active calls of functions or improper use of automatic objects. Such cases may suggest existence of bugs, e.g. a recursive function call without correct exit conditions.
Nevertheless, deep recursive calls are sometimes desired. In implementations of languages requiring support of unbound active calls (call depth only limited by total memory), it is impossible to use native call stack directly as the target language activation record like typical C++ implementations. For example, SML/NJ explicitly allocates frames on the heap and uses cactus stacks. The complicated allocation of such activation record frames is usually not fast as the call stack frames. However, when further implementing languages with proper tail recursion, direct stack allocation in object language (that is, the "object" in the language does not stored as references, but primitive values which can be one-to-one mapped to unshared C++ objects) is even more complicated with more performance penalty in general. When using C++ to implement such languages, it is difficult to estimate the performance impacts.
Aside from the orders-of-magnitude performance advantage over heap allocation, stack allocation is preferable for long running server applications. Even the best managed heaps eventually get so fragmented that application performance degrades.
It's not jsut stack allocation that's faster. You also win a lot on using stack variables. They have better locality of reference. And finally, deallocation is a lot cheaper too.
It would be like this in asm. When you're in
func
, thef1
and pointerf2
has been allocated on stack (automated storage). And by the way, Foof1(a1)
has no instruction effects on stack pointer (esp
),It has been allocated, iffunc
wants get the memberf1
, it's instruction is something like this:lea ecx [ebp+f1], call Foo::SomeFunc()
. Another thing the stack allocate may make someone think the memory is something likeFIFO
, theFIFO
just happened when you go into some function, if you are in the function and allocate something likeint i = 0
, there no push happened.In general, stack allocation is faster than heap allocation as mentioned by almost every answer above. A stack push or pop is O(1), whereas allocating or freeing from a heap could require a walk of previous allocations. However you shouldn't usually be allocating in tight, performance-intensive loops, so the choice will usually come down to other factors.
It might be good to make this distinction: you can use a "stack allocator" on the heap. Strictly speaking, I take stack allocation to mean the actual method of allocation rather than the location of the allocation. If you're allocating a lot of stuff on the actual program stack, that could be bad for a variety of reasons. On the other hand, using a stack method to allocate on the heap when possible is the best choice you can make for an allocation method.
Since you mentioned Metrowerks and PPC, I'm guessing you mean Wii. In this case memory is at a premium, and using a stack allocation method wherever possible guarantees that you don't waste memory on fragments. Of course, doing this requires a lot more care than "normal" heap allocation methods. It's wise to evaluate the tradeoffs for each situation.