Probably this is all about not even micro- but nanooptimizations, but the subject interests me and I would like to know if there are any penalties when using non-native register sizes in long mode?
I've learned from various sources, that partial register updates (like ax
instead of eax
) can cause eflags
stall and degrade performance. But I'm not sure about the long mode. What register size is considered native for this processor operation mode? x86-64 are still extensions to x86 architecture, thus I believe 32 bits are still native. Or am I wrong?
For example, instructions like
sub eax, r14d
or
sub rax, r14
have the same size, but may there be any penalties when using either of those?
May there be any penalties when mixing register sizes in consecutive instructions like the below? (assuming high dword is zero in all cases)
sub ecx, eax
sub r14, rax
May there be any penalties when mixing 32 and 64-bit register sizes in consecutive instructions?
No, writing to a 32-bit register always zero-extends to the full register, so x86-64 avoids any partial-register penalties for 32 and 64-bit instruction.
thus I believe 32 bits are still native.
Yes, the default operand-size is 32-bit for most instructions (other than PUSH/POP). 64-bit needs a REX prefix with the W bit set to 1. So prefer 32-bit for code-size reasons. This is why compilers use mov r32, imm32
for addresses of static data (since the default code-model requires that code and static data addresses are in the low 2GiB of virtual address space).
It was a design choice by AMD. They could have chosen the other way, and required a prefix to get 32-bit operand size. Since long mode is a separate mode, x86-64 machine code can be different from x86-32 machine code however it wants. AMD chose to minimize the differences so they could share as many transistors as possible in the decoders. Your conclusion is correct, but your reasoning is totally bogus.
partial register updates (like ax instead of eax) can cause eflags stall and degrade performance.
Partial-flag stalls are separate from partial-register stalls. They're handled similarly internally (the separately-renamed parts of EFLAGS have to be merged the same as a modified AX has to be merged with the unmodified upper bytes of EAX). But one doesn't cause the other.
# partial-reg stall
setcc al # leaves the upper 3 (or 7) bytes unmodified
add edx, eax # reads full EAX. Older CPUs stall while merging
Zeroing EAX ahead of the flag-setting and setcc with xor eax,eax
avoids the partial-register penalty entirely. (Core2/Nehalem stalls for fewer cycles than earlier CPUs, but does still stall for 2 or 3c while inserting a merging uop. Sandybridge doesn't stall at all while inserting the merging uop).
(Another summary of partial register penalties on different CPUs: Why doesn't GCC use partial registers?, saying basically the same thing).
AMD doesn't suffer from partial-register stalls when reading the full register later, but instead partial-register writes and reads have a false dependency on the full register. (AMD CPUs don't rename sub-registers separately in the first place. Intel P4 and Silvermont / Knight's Landing are the same way.)
Intel Haswell/Skylake (and maybe Ivybridge) don't rename al
separately from rax
at all, so they never need to merge low8 / low16 registers. But the setcc al
has a false dependency on the old value. They do still rename and merge ah
. (Details on HSW/SKL partial-reg performance.)
# partial flag stall when reading a flag that didn't come from
# the last instruction to write any flags.
clc
# edi and esi = one-past-the-end of dst and src
# ecx = -count
bigInt_add:
mov eax, [esi+ecx*4]
adc [edi+ecx*4], eax # reads CF, partial flag stall on 2nd and later iterations
inc ecx # writes all flags except CF
jl bitInt_add # loop upwards towards zero
See this Q&A for more discussion about partial-flags issues on Intel pre-Sandybridge vs. Sandybridge.
See also Agner Fog's microarch pdf, and other links in the x86 tag wiki for more details about all of this.