可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I need to create a function that swaps the value of &x with the value of &y (meaning swap *(&y) and *(&x).

Swap:

    push EBP
    mov EBP,ESP
    mov EBX, [EBP+12] ; ebx = *x
    mov EAX, DWORD [EBX] ;eax = ebx = *x
    mov DWORD [EBP-4], EAX ; [ebp-4] = eax =*x
    mov EDX, [EBP+8] ; edx = *y
    mov EAX, DWORD [EDX] ; eax = *edx = *y
    mov DWORD [EBX], EAX ; ebx = eax = *y
    mov EAX, DWORD [EBP-4] ; eax = *x
    mov DWORD [EDX], EAX ; edx = *x
    pop EBP ; ebx = *y and edx = *x
    ret

I call it like this:

    // call Swap
    push x
    push y
    call swap

I don't understand why it's not working. I added comments that explain my understanding of it. What's wrong with my implementation? How can I fix it?

回答1:

You don't actually reserve memory on the stack that you use when you access a dword at [EBP-4]. It can get overwritten by things like interrupt routines, signal handlers, asynchronously called procedures, whatever applies in your OS.

The code should be like this instead:

swap:
    push  EBP
    mov   EBP,ESP           ; make a traditional stack frame

    sub   ESP, 4         ; reserve memory for a local variable at [EBP-4]

    mov   EBX, [EBP+12]        ; ebx = &x
    mov   EAX, DWORD [EBX]     ; eax = x
    mov   DWORD [EBP-4], EAX   ; [ebp-4] = eax = x
    mov   EDX, [EBP+8]         ; edx = &y
    mov   EAX, DWORD [EDX]     ; eax = y
    mov   DWORD [EBX], EAX     ; *&x = y
    mov   EAX, DWORD [EBP-4]   ; eax = x reloaded from the local
    mov   DWORD [EDX], EAX     ; *&y = x

    leave          ; remove locals (by restoring ESP), restore EBP

    ret

Also, make sure that you're passing as parameters the addresses of the variables x and y, not the values of the variables. push x+push y will pass the addresses of x and y in NASM but they will pass values of x and y in TASM and MASM.

回答2:

Aside from Alexey's bugfix, you could make this significantly more efficient. (Of course inlining the swap and optimizing at the call site is even better.)

There's no need for a local temporary on the stack: you could instead reload one of the addresses twice, or save/restore ESI and use it as a temporary.

You're actually destroying EBX, which is call-preserved in all the normal C calling conventions. In most 32-bit x86 calling conventions, EAX, ECX, and EDX are the three call-clobbered registers you can use without saving/restoring, while the others are call-preserved. (So i.e. your caller expects you not to destroy their values, so you can only use them if you put back the original value. This is why EBP has to be restored after you use it for a frame pointer.)

What gcc -O3 -m32 does when compiling a stand-alone (not inlined) definition for a swap function is save/restore EBX so it has 4 registers to play with. clang chooses ESI.

void swap(int *px, int *py) {
    int tmp = *px;
    *px = *py;
    *py = tmp;
}

On the Godbolt compiler explorer:

# gcc8.2 -O3 -m32 -fverbose-asm
# gcc itself emitted the comments on the following instructions
swap:
        push    ebx     #
        mov     edx, DWORD PTR [esp+8]    # px, px
        mov     eax, DWORD PTR [esp+12]   # py, py
        mov     ecx, DWORD PTR [edx]      # tmp, *px_3(D)
        mov     ebx, DWORD PTR [eax]      # tmp91, *py_5(D)
        mov     DWORD PTR [edx], ebx      # *px_3(D), tmp91
        mov     DWORD PTR [eax], ecx      # *py_5(D), tmp
        pop     ebx       #
        ret  

# DWORD PTR is the gas .intel_syntax equivalent of NASM's DWORD
# you can just remove them all because the register implies an operand size

It also avoids making a legacy stack-frame. You can add -fno-omit-frame-pointer to the compiler options to see code-gen with a frame pointer, if you want. (Godbolt will recompile and show you the asm. Very handy site for exploring compiler options and code changes.)

64-bit calling conventions already have args in registers, and have enough scratch regs so we just get 4 instructions, much more efficient.

As I mentioned, another option is to reload one of the pointer args twice:

swap:
       # without a push, offsets relative to ESP are smaller by 4
        mov     edx, [esp+4]    # edx = px   reused later
        mov     eax, [esp+8]    # eax = py   also reused later
        mov     ecx, [edx]      # ecx = tmp = *px   lives for the whole function

        mov     eax, [eax]      # eax = *py   destroying our register copy of py
        mov    [edx], eax       # *px = *py;  done with px, can now destroy it

        mov     edx, [esp+8]   # edx = py
        mov    [edx], ecx       # *py = tmp;
        ret

Only 7 instructions instead of 8. Loading the same value twice is very cheap, and out-of-order execution means it's not a problem to have the store address ready quickly even though in program order it's only the instruction right before the store that loads the address.