Why is it better to use the ebp than the esp regis

2019-07-26 10:15发布

问题:

I am new to MASM. I have confusion regarding these pointer registers. I would really appreciate if you guys help me.

Thanks

回答1:

Encoding an addressing mode using [ebp + disp8] is one byte shorter than [esp+disp8], because using ESP as a base register requires a SIB byte. See rbp not allowed as SIB base? for details. (That question title is asking about the fact that [ebp] has to be encoded as [ebp+0].)

The first time [esp + disp8] is used after a push or pop, or after a call, will require a stack-sync uop on Intel CPUs. (What is the stack engine in the Sandybridge microarchitecture?). Of course, mov ebp, esp to make a stack frame in the first place also triggers a stack-sync uop: any explicit reference to ESP in the out-of-order core (not just addressing modes) cause a stack-sync uop if the stack engine might have an offset that the out-of-order back end doesn't know about.


The traditional stack-frame setup with ebp creates a linked-list of stack frames (each saved EBP pointing at the parent's saved EBP, right below a return address), handy for profiling and sometimes debugging if your code doesn't have alternate metadata that lets your debugger unwind the stack to show stack backtraces.


But despite these downsides to using ESP, it's often not better (for performance) to use EBP as a frame pointer, because it uses up an extra one of the 8 GP registers for the stack, leaving you with 6 instead of 7 you can actually use for stuff other than the stack. Modern compilers default to -fomit-frame-pointer when optimization is enabled.

It's easy for compilers to keep track of how much ESP has moved relative to where they stored something because they know how much sub esp,28 moves the stack pointer. Even after pushing a function arg, they still know the right ESP-relative offset to anything they stored on the stack earlier in the function.

Humans can do it, too, but it's easy to make a mistake when you modify the function to reserve some extra space and forget to update all the offsets from ESP to your locals and stack args, if any. (Normally it's not worth hand-writing large functions that can't keep most of their variables in registers, though. Leave that to the compiler and only spend your time writing the hot loops in asm, if at all.)

The exception is if your function allocates a variable amount of stack space (like C alloca or C99 variable length arrays like int arr[n]); in that case compilers will make a traditional stack frame with EBP. Or in hand-written asm, if you push in a loop to use the call stack as a Stack data structure.


For example, x86 MSVC 19.14 compiles this C

int foo() {
    volatile int i = 0;  // force it to be stored to memory
    return i;
}

Into this MASM asm. (See it yourself on the Godbolt compiler explorer)

;;; MSVC -O2
_i$ = -4                                                ; size = 4
int foo(void) PROC                                        ; foo, COMDAT
        push    ecx
        mov     DWORD PTR _i$[esp+4], 0           ; note this is actually [esp+0] ; _i$ = -4
        mov     eax, DWORD PTR _i$[esp+4]
        pop     ecx
        ret     0
int foo(void) ENDP                                        ; foo

Notice that it reserves space for i with a push instead of sub esp, 4 because that saves code-size and is usually about the same performance. It's the same number of uops for the front-end, with no extra stack-sync uops, because the push is before any explicit reference to esp, and the pop is after the last one.

(If it was reserving more than 4 bytes, I think it would just use a normal sub esp, 8 or whatever.)

There's an obvious missed optimization here; push 0 would store the value it actually wants, instead of whatever garbage was in ECX. (What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?). And pop eax would clean the stack and load i as the return value.

vs. this with optimization disabled. Notice that _i$ = -4 is the same offset from the "stack frame", but that the optimized code used esp+4 as the base while this uses ebp. That's mostly just a fun-fact of MSVC internals, that it seems to think in terms of where EBP would be if it hadn't optimized away frame-pointer creation. Picking a reference point makes sense, and lining up with it's frame-pointer-enabled choice is the obvious choice.

;;; MSVC -O0
_i$ = -4                                                ; size = 4
int foo(void) PROC                                        ; foo
        push    ebp
        mov     ebp, esp                     ; make a stack frame
        push    ecx
        mov     DWORD PTR _i$[ebp], 0
        mov     eax, DWORD PTR _i$[ebp]
        mov     esp, ebp
        pop     ebp
        ret     0
int foo(void) ENDP                                        ; foo

Interesting, it still uses push/pop to reserve 4 bytes of stack space. This time it does cause one extra stack-sync uop on Intel CPUs, because the push ecx after the mov ebp,esp re-dirties the stack engine before mov esp, ebp. But that's pretty trivial.