I am reading the IDA Pro Book. On page 86 while discussing calling conventions, the author shows an example of cdecl calling convention that eliminates the need for the caller to clean arguments off the stack. I am reproducing the code snippet below:
; demo_cdecl(1, 2, 3, 4); //programmer calls demo_cdecl
mov [esp+12], 4 ; move parameter z to fourth position on stack
mov [esp+8], 3 ; move parameter y to third position on stack
mov [esp+4], 2 ; move parameter x to second position on stack
mov [esp], 1 ; move parameter w to top of stack
call demo_cdecl ; call the function
The author goes on to say that
in the above example, the compiler has preallocated storage space for the arguments to demo_cdecl at the top of the stack during the function prologue.
I am going to assume that there is a sub esp, 0x10
at the top of the code snippet. Otherwise, you would just be corrupting the stack.
He later says that the caller doesn't need to adjust the stack when call to demo_cdecl completes. But surely, there has to be a add esp, 0x10
after the call.
What exactly am I missing?
Compilers often choose
mov
to store args instead ofpush
, if there's enough space already allocated (e.g. with asub esp, 0x10
earlier in the function like you suggested).Here's an example:
compiled by
clang6.0 -O3 -march=haswell
on Godboltclang's code-gen would have been even better with
sub esp,8
/push 2
, but then the rest of the function unchanged. i.e. letpush
grow the stack because it has smaller code-size thatmov
, especiallymov
-immediate, and performance is not worse (because we're about tocall
which also uses the stack engine). See What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once? for more details.I also included in the Godbolt link GCC output with/without
-maccumulate-outgoing-args
that defers clearing the stack until the end of the function..By default (without accumulate outgoing args) gcc does let ESP bounce around, and even uses 2x
pop
to clear 2 args from the stack. (Avoiding a stack-sync uop, at the cost of 2 useless loads that hit in L1d cache). With 3 or more args to clear, gcc usesadd esp, 4*N
. I suspect that reusing the arg-passing space withmov
stores instead of add esp / push would be a win sometimes for overall performance, especially with registers instead of immediates. (push imm8
is much more compact thanmov imm32
.)With
-maccumulate-outgoing-args
, the output is basically like clang, but gcc still save/restoresebx
and keepsa
in it, before doing a tailcall.Note that having ESP bounce around requires extra metadata in
.eh_frame
for stack unwinding. Jan Hubicka writes in 2014:So a 4% code-size saving (in bytes; matters for L1i cache footprint) from using push for args and at least typically clearing them off the stack after each
call
. I think there's a happy medium here that gcc could use morepush
without using justpush
/pop
.There's a confounding effect of maintaining 16-byte stack alignment before
call
, which is required by the current version of the i386 System V ABI. In 32-bit mode, it used to just be a gcc default to maintain-mpreferred-stack-boundary=4
. (i.e. 1<<4). I think you can still use-mpreferred-stack-boundary=2
to violate the ABI and make code that only cares about 4B alignment for ESP.I didn't try this on Godbolt, but you could.
The parameters are stored at addresses that are positive offsets from the stack pointer. Remember that the stack grows downwards. This means that the space required to hold these parameters has already been allocated (probably by the caller's prologue code). That's why there is no need for
sub esp, N
for each call sequence.In the cdecl calling convention, the caller always has to clean up the stack one way or another. If allocation was done by the caller's prologue, it will be deallocated by the epilogue (together with the caller's local variables). Otherwise, if the parameters of the callee were allocated somewhere in the middle of the caller's code, then the easiest way to clean up is by using add
esp, N
right after the call instruction.There is a trade-off involved between these two different implementations of the cdecl calling convention. Allocating parameters in the prologue means that the largest space required by any callee must be allocated. It will be reused for each callee. Then at the end of the caller, it will be cleaned up once. So this may unnecessarily waste stack space, but it may improve performance. In the other technique, the caller only allocates space for parameters when the associated call site is actually going to be reached. Cleanup is then performed right after the callee returns. So no stack space is wasted. But allocation and cleanup have to be performed at each call site in the caller. You can also imagine an implementation that is in between these two extremes.