I need to create a function that swaps the value of &x with the value of &y (meaning swap *(&y) and *(&x).
Swap:
push EBP
mov EBP,ESP
mov EBX, [EBP+12] ; ebx = *x
mov EAX, DWORD [EBX] ;eax = ebx = *x
mov DWORD [EBP-4], EAX ; [ebp-4] = eax =*x
mov EDX, [EBP+8] ; edx = *y
mov EAX, DWORD [EDX] ; eax = *edx = *y
mov DWORD [EBX], EAX ; ebx = eax = *y
mov EAX, DWORD [EBP-4] ; eax = *x
mov DWORD [EDX], EAX ; edx = *x
pop EBP ; ebx = *y and edx = *x
ret
I call it like this:
// call Swap
push x
push y
call swap
I don't understand why it's not working. I added comments that explain my understanding of it. What's wrong with my implementation? How can I fix it?
You don't actually reserve memory on the stack that you use when you access a dword at [EBP-4]. It can get overwritten by things like interrupt routines, signal handlers, asynchronously called procedures, whatever applies in your OS.
The code should be like this instead:
swap:
push EBP
mov EBP,ESP ; make a traditional stack frame
sub ESP, 4 ; reserve memory for a local variable at [EBP-4]
mov EBX, [EBP+12] ; ebx = &x
mov EAX, DWORD [EBX] ; eax = x
mov DWORD [EBP-4], EAX ; [ebp-4] = eax = x
mov EDX, [EBP+8] ; edx = &y
mov EAX, DWORD [EDX] ; eax = y
mov DWORD [EBX], EAX ; *&x = y
mov EAX, DWORD [EBP-4] ; eax = x reloaded from the local
mov DWORD [EDX], EAX ; *&y = x
leave ; remove locals (by restoring ESP), restore EBP
ret
Also, make sure that you're passing as parameters the addresses of the variables x
and y
, not the values of the variables. push x
+push y
will pass the addresses of x
and y
in NASM but they will pass values of x
and y
in TASM and MASM.
Aside from Alexey's bugfix, you could make this significantly more efficient. (Of course inlining the swap and optimizing at the call site is even better.)
There's no need for a local temporary on the stack: you could instead reload one of the addresses twice, or save/restore ESI and use it as a temporary.
You're actually destroying EBX, which is call-preserved in all the normal C calling conventions. In most 32-bit x86 calling conventions, EAX, ECX, and EDX are the three call-clobbered registers you can use without saving/restoring, while the others are call-preserved. (So i.e. your caller expects you not to destroy their values, so you can only use them if you put back the original value. This is why EBP has to be restored after you use it for a frame pointer.)
What gcc -O3 -m32
does when compiling a stand-alone (not inlined) definition for a swap function is save/restore EBX so it has 4 registers to play with. clang chooses ESI.
void swap(int *px, int *py) {
int tmp = *px;
*px = *py;
*py = tmp;
}
On the Godbolt compiler explorer:
# gcc8.2 -O3 -m32 -fverbose-asm
# gcc itself emitted the comments on the following instructions
swap:
push ebx #
mov edx, DWORD PTR [esp+8] # px, px
mov eax, DWORD PTR [esp+12] # py, py
mov ecx, DWORD PTR [edx] # tmp, *px_3(D)
mov ebx, DWORD PTR [eax] # tmp91, *py_5(D)
mov DWORD PTR [edx], ebx # *px_3(D), tmp91
mov DWORD PTR [eax], ecx # *py_5(D), tmp
pop ebx #
ret
# DWORD PTR is the gas .intel_syntax equivalent of NASM's DWORD
# you can just remove them all because the register implies an operand size
It also avoids making a legacy stack-frame. You can add -fno-omit-frame-pointer
to the compiler options to see code-gen with a frame pointer, if you want. (Godbolt will recompile and show you the asm. Very handy site for exploring compiler options and code changes.)
64-bit calling conventions already have args in registers, and have enough scratch regs so we just get 4 instructions, much more efficient.
As I mentioned, another option is to reload one of the pointer args twice:
swap:
# without a push, offsets relative to ESP are smaller by 4
mov edx, [esp+4] # edx = px reused later
mov eax, [esp+8] # eax = py also reused later
mov ecx, [edx] # ecx = tmp = *px lives for the whole function
mov eax, [eax] # eax = *py destroying our register copy of py
mov [edx], eax # *px = *py; done with px, can now destroy it
mov edx, [esp+8] # edx = py
mov [edx], ecx # *py = tmp;
ret
Only 7 instructions instead of 8. Loading the same value twice is very cheap, and out-of-order execution means it's not a problem to have the store address ready quickly even though in program order it's only the instruction right before the store that loads the address.