More efficient assembly code?

2019-06-06 03:35发布

I've recently began studying assembly. Just wondering why this assembly is written the way it is instead of the alternative "My Assembly" I list below. It cuts out one instruction. Any ideas? Is it too rare of a case where this works? Just seems wasteful to me to move the value of 3 to eax first.

C code:

#include<stdio.h>

int main()
{
   int a = 1;
   int b = 3;
   a = a+b;
   return a;
}

Assembly:

Dump of assembler code for function main:
0x080483dc <+0>:    push   ebp
0x080483dd <+1>:    mov    ebp,esp
0x080483df <+3>:    sub    esp,0x10
0x080483e2 <+6>:    mov    DWORD PTR [ebp-0x4],0x1
0x080483e9 <+13>:   mov    DWORD PTR [ebp-0x8],0x3
0x080483f0 <+20>:   mov    eax,DWORD PTR [ebp-0x8]
0x080483f3 <+23>:   add    DWORD PTR [ebp-0x4],eax
0x080483f6 <+26>:   mov    eax,DWORD PTR [ebp-0x4]
0x080483f9 <+29>:   leave  
0x080483fa <+30>:   ret   

"My Assembly":

Dump of assembler code for function main:
0x080483dc <+0>:    push   ebp
0x080483dd <+1>:    mov    ebp,esp
0x080483df <+3>:    sub    esp,0x10
0x080483e2 <+6>:    mov    DWORD PTR [ebp-0x4],0x1
0x080483e9 <+13>:   mov    DWORD PTR [ebp-0x8],0x3
0x080483f0 <+20>:   add    DWORD PTR [ebp-0x4],DWORD PTR [ebp-0x8]
0x080483f3 <+23>:   mov    eax,DWORD PTR [ebp-0x4]
0x080483f6 <+26>:   leave
0x080483f9 <+29>:   ret   

1条回答
疯言疯语
2楼-- · 2019-06-06 04:26

As Michael Petch has already said in a comment, the real answer is that you're looking at unoptimized code. Compilers do all sorts of…well, inefficient things in unoptimized code. Sometimes they do it for compilation speed. Optimizing takes longer than blindly translating the C code into assembly instructions, so when you want raw speed, you turn off the optimizer and use only the compiler: a relatively simple-minded instruction translator. Another reason that compilers do inefficient things in unoptimized code is to make debugging easier. For example, your IDE probably lets you set a breakpoint on each individual line of your C/C++ code. If the optimizer had turned multiple C/C++ lines into a single assembly instruction, it would be much more difficult, if not impossible, for you to set the breakpoints you wanted to set. This is why debugging optimized code is much more difficult, and often requires dropping down to the raw assembly and doing address-level debugging.

There are two dead giveaways here that tell you this is unoptimized code:

  1. The use of the leave instruction, which is essentially a historical relic of the x86's CISC days. The philosophy used to be to have a bunch of instructions that did complicated things, so the enter instruction was used at the beginning of a function to set up the stack frame, and the leave instruction brought up the rear, tearing down the stack frame. This made programmer's job easier in block-structured languages, because you only needed to write a single instruction to accomplish multiple tasks. The problem is, since at least the 386, possibly the 286, the enter instruction has been substantially slower than doing the same thing with simpler, separate instructions. leave is also slower on the 386 and later, and is only useful when you're optimizing for size over speed (since it is smaller and not quite as slow as enter).

  2. The fact that a stack frame is being set up at all! At any optimization level, a 32-bit x86 compiler won't bother to generate prologue code that sets up a stack frame. That is, it won't save the original value of the EBP register and it won't set the EBP register to the location of the stack pointer (ESP) at function entry. Instead, it will perform the "frame pointer omission" optimization (the EBP register is called the "frame pointer"), and instead of using EBP-relative offsets to access the stack, it will just use ESP-relative offsets. This didn't used to be possible in 16-bit x86 code, but it works fine in 32-bit code, it just takes more bookkeeping, since the stack pointer is subject to change, but the frame pointer could be held constant. Such bookkeeping isn't nearly the problem for a computer/compiler that it would be for a human, so this is an obvious optimization.

Another problem with "your" assembly is that you've used an invalid instruction. There is no instruction* in the x86 architecture that accepts two memory operands. At most, one of the operands can be a memory location. The other operand must either be a register or an immediate.

The first-blush "optimized" version of this code would be something like:

; Allocate 8 bytes of space on the stack for our local variables, 'a' and 'b'.
sub  esp, 8

; Load the values of 'a' and 'b', storing them into the allocated locations.
; (Note the use of ESP-relative offsets, rather than EBP-relative offsets.)
mov  DWORD PTR [esp],     1
mov  DWORD PTR [esp + 4], 3

; Load the value of 'a' into a register (EAX), and add 'b' to it.
; (Necessary because we can't do an ADD with two memory operands.)
mov  eax, DWORD PTR [esp]
add  eax, DWORD PTR [esp + 4]

; The result is now in EAX, which is exactly where we want it to be.
; (All x86 calling conventions return integer-sized values in EAX.)

; Clean up the stack, and return.
add  esp, 8
ret

We've "optimized" the stack initialization sequence, and lost a lot of the fluff. Things look pretty good now. In fact, this is essentially the code that a compiler will generate if you were to declare the a and b variables volatile. However, they're not actually volatile in the original code, which means that we can keep them entirely in registers. This frees us from having to do any costly memory stores/loads, and means we don't have to allocate or restore stack space at all!

; Load the 'a' and 'b' values into the EAX and EDX registers, respectively.
mov  eax, 1
mov  edx, 3

; Add 'b' to 'a' in a single operation, since ADD works fine with
; two register operands.
add  eax, edx

; Return, with result in EAX.
ret

Neat, right? This not only simplifies the code, but is actually a big performance win, since we're keeping everything in registers and never have to touch slow memory. Well, what else can we do? Remember that the ADD instruction allows us to use a register as the destination operand and an immediate as the source operand. That means we could skip a MOV and just do:

mov  eax, 1
add  eax, 3
ret

This is similar to what you would expect to see if you were, say, adding a constant 3 to a value already in memory:

add  DWORD PTR [esp + 4], 3

But in this case, an optimizing compiler would never do it that way. It will actually outsmart you, realizing that you're doing an addition of compile-time constants, and go ahead and do the addition at compile time. Thus, the actual output of a compiler—and indeed, the most efficient way to write this code—would be simply:

mov  eax, 4
ret

How anti-climactic. :-) The fastest code is always the code that doesn't have to execute.


*At least, not that I can think of at the moment. The x86 ISA is colossal, so it's almost inevitable that there's some dark corner of it that I can't think of where this statement is false. But it's true enough that you can rely on it as an axiom.

查看更多
登录 后发表回答