I am new to x86_64 assembly programming. I was writing simple "Hello World" program in x86_64 assembly. Below is my code, which runs perfectly fine.
global _start
section .data
msg: db "Hello to the world of SLAE64", 0x0a
mlen equ $-msg
section .text
_start:
mov rax, 1
mov rdi, 1
mov rsi, msg
mov rdx, mlen
syscall
mov rax, 60
mov rdi, 4
syscall
Now when I disassemble in gdb, it gives below output:
(gdb) disas
Dump of assembler code for function _start:
=> 0x00000000004000b0 <+0>: mov eax,0x1
0x00000000004000b5 <+5>: mov edi,0x1
0x00000000004000ba <+10>: movabs rsi,0x6000d8
0x00000000004000c4 <+20>: mov edx,0x1d
0x00000000004000c9 <+25>: syscall
0x00000000004000cb <+27>: mov eax,0x3c
0x00000000004000d0 <+32>: mov edi,0x4
0x00000000004000d5 <+37>: syscall
End of assembler dump.
My question is why NASM behaves in such way? I know it changes instructions based on opcode, but I am not sure about same behaviour with registers.
Also does this behaviour affects functionality of executable?
I am using Ubuntu 16.04 (64 bit) installed in VMware on i5 processor.
Thank you in advance.
In 64-bit mode mov eax, 1
will clear the upper part of the rax
register (see here for an explanation) thus mov eax, 1
is semantically equivalent to mov rax, 1
.
The former however spare a REX.W (48h
numerically) prefix (a byte necessary to specify the registers introduced with x86-64), the opcode is the same for both instructions (0b8h
followed by a DWORD or a QWORD).
So the assembler goes ahead and picks up the shortest form.
This is a typical behavior of NASM, see Section 3.3 of the NASM's manual where the example of [eax*2]
is assembled as [eax+eax]
to spare the disp32
field after the SIB byte1 ([eax*2]
is only encodable as [eax*2+disp32]
where the assembler set disp32
to 0).
I was unable to force NASM to emit a real mov rax, 1
instruction (i.e. 48 B8 01 00 00 00 00 00 00 00
) even by prefixing the instruction with o64
.
If a real mov rax, 1
is needed (this is not your case), one must resort to assembling it manually with db
and similar.
EDIT: Peter Cordes' answer shows that there is, in fact, a way to tell NASM not to optimize an instruction with the strict
modifier.
mov rax, STRICT 1
produces the 10-byte version of the instruction (mov r64, imm64
) while mov rax, STRICT DWORD 1
produces a 7-byte version (mov r64, imm32
where imm32
is sign-extended before use).
Side note: It's better to use the RIP-relative addressing, this avoids 64-bit immediate constants (thus reducing code size) and is mandatory in MacOS (in case you cared).
Change the mov esi, msg
to lea esi, [REL msg]
(RIP-relative is an addressing mode so it needs an "addressing", the square bracket, to avoid reading from that address we use lea
that only computes the effective address but does no access).
You can use the directive DEFAULT REL
to avoid typing REL
in each memory access.
I was under the impression that the Mach-O file format required PIC code but this may not be the case.
1 The Scale Index Base byte, used to encode the new addressing mode introduced back then with the 32-bit mode.
This is a perfectly safe and useful optimization, very similar to using an 8-bit immediate instead of a 32-bit immediate when you write add eax, 1
.
NASM only optimizes when the shorter form of the instruction has an identical architectural effect, because mov eax,1
implicitly zeros the upper 32 bits of RAX.
But note that YASM doesn't do it, so it's a good idea to make the optimization yourself in the asm source, if you care about code-size (even indirectly for performance reasons).
For instructions where 32 and 64-bit operand size wouldn't be equivalent if you had very large (or negative) numbers, you need to use 32-bit operand-size explicitly even if you're assembling with NASM instead of YASM, if you want the size / performance advantage of 32-bit operand-size.
The advantages of using 32bit registers/instructions in x86-64
For 32-bit constants that don't have their high bit set, zero or sign extending them to 64 bits gives an identical result. Thus it's a pure optimization to assemble mov rax, 1
to a 5-byte mov r32, imm32
(with implicit zero extension to 64 bits) instead of a 7-byte mov r/m64, sign_extended_imm32
.
On all current x86 CPUs, the only performance difference between that and the 7-byte encoding is code-size, so only indirect effects like alignment and L1I$ pressure are a factor. Internally it's just a mov-immediate, so this optimization doesn't change the microarchitectural effect of your code either (except of course for code-size / alignment / how it packs in the uop cache).
The 10-byte mov r64, imm64
encoding is even worse for code size. If the constant actually has any of its high bits set, then it has extra inefficiency in the uop cache on Intel Sandybridge-family CPUs (using 2 entries in the uop cache, and maybe an extra cycle to read from the uop cache). But if the constant is in the -2^31 .. +2^31 range (signed 32-bit), it's stored internally just as efficiently, using only a single uop-cache entry, even if it was encoded in the x86 machine code using a 64-bit immediate. (See Agner Fog's microarch doc, Table 9.1. Size of different instructions in μop cache in the Sandybridge section)
From How many ways to set a register to zero?, you can force any of the three encodings with NASM:
mov eax, 1 ; 5 bytes to encode (B8 imm32)
mov rax, strict dword 1 ; 7 bytes: REX mov r/m64, sign-extended-imm32. NASM optimizes mov rax,1 to the 5B version, but dword or strict dword stops it for some reason
mov rax, strict qword 1 ; 10 bytes to encode (REX B8 imm64). movabs mnemonic for AT&T. Normally assemblers choose smaller encodings if the operand fits, but strict qword forces the imm64.
Note that NASM used the 10-byte encoding (which AT&T syntax calls movabs
, and so does objdump
in Intel-syntax mode) for an address which is a link-time constant but unknown at assemble time.
YASM chooses mov r64, imm32
, i.e. it assumes a code-model where label addresses are 32 bits, unless you use mov rsi, strict qword msg
YASM's behaviour is normally good (although using mov r32, imm32
for static absolute addresses like C compilers do would be even better). The default non-PIC code-model puts all static code/data in the low 2GiB of virtual address space, so zero- or sign-extended 32-bit constants can hold addresses.
If you want 64-bit label addresses you should normally use lea r64, [rel address]
to do a RIP-relative LEA. (On Linux at least, position-dependent code can go in the low 32, so unless you're using the large / huge code models, any time you need to care about 64-bit label addresses, you're also making PIC code where you should use RIP-relative LEA to avoid needing text relocations of absolute address constants).
i.e. gcc and other compilers would have used mov esi, msg
, or lea rsi, [rel msg]
, never mov rsi, msg
.