This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul
throughput as expected. But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because setnz al
has a dependency on the last imul
.
; synthetic micro-benchmark to test partial-register renaming
mov ecx, 1000000000
.loop: ; do{
imul eax, eax ; a dep chain with high latency but also high throughput
imul eax, eax
imul eax, eax
dec ecx ; set ZF, independent of old ZF. (Use sub ecx,1 on Silvermont/KNL or P4)
setnz al ; ****** Does this depend on RAX as well as ZF?
movzx eax, al
jnz .loop ; }while(ecx);
If setnz al
depends on rax
, the 3ximul/setcc/movzx sequence forms a loop-carried dependency chain. If not, each setcc
/movzx
/3ximul
chain is independent, forked off from the dec
that updates the loop counter. The 11c per iteration measured on HSW/SKL is perfectly explained by a latency bottleneck: 3x3c(imul) + 1c(read-modify-write by setcc) + 1c(movzx within the same register).
Off topic: avoiding these (intentional) bottlenecks
I was going for understandable / predictable behaviour to isolate partial-reg stuff, not optimal performance.
For example, xor
-zero / set-flags / setcc
is better anyway (in this case, xor eax,eax
/ dec ecx
/ setnz al
). That breaks the dep on eax on all CPUs (except early P6-family like PII and PIII), still avoids partial-register merging penalties, and saves 1c of movzx
latency. It also uses one fewer ALU uop on CPUs that handle xor-zeroing in the register-rename stage. See that link for more about using xor-zeroing with setcc
.
Note that AMD, Intel Silvermont/KNL, and P4, don't do partial-register renaming at all. It's only a feature in Intel P6-family CPUs and its descendant, Intel Sandybridge-family, but seems to be getting phased out.
gcc unfortunately does tend to use cmp
/ setcc al
/ movzx eax,al
where it could have used xor
instead of movzx
(Godbolt compiler-explorer example), while clang uses xor-zero/cmp/setcc unless you combine multiple boolean conditions like count += (a==b) | (a==~b)
.
The xor/dec/setnz version runs at 3.0c per iteration on Skylake, Haswell, and Core2 (bottlenecked on imul
throughput). xor
-zeroing breaks the dependency on the old value of eax
on all out-of-order CPUs other than PPro/PII/PIII/early-Pentium-M (where it still avoids partial-register merging penalties but doesn't break the dep). Agner Fog's microarch guide describes this. Replacing the xor-zeroing with mov eax,0
slows it down to one per 4.78 cycles on Core2: 2-3c stall (in the front-end?) to insert a partial-reg merging uop when imul
reads eax
after setnz al
.
Also, I used movzx eax, al
which defeats mov-elimination, just like mov rax,rax
does. (IvB, HSW, and SKL can rename movzx eax, bl
with 0 latency, but Core2 can't). This makes everything equal across Core2 / SKL, except for the partial-register behaviour.
The Core2 behaviour is consistent with Agner Fog's microarch guide, but the HSW/SKL behaviour isn't. From section 11.10 for Skylake, and same for previous Intel uarches:
Different parts of a general purpose register can be stored in different temporary registers in order to remove false dependences.
He unfortunately doesn't have time to do detailed testing for every new uarch to re-test assumptions, so this change in behaviour slipped through the cracks.
Agner does describe a merging uop being inserted (without stalling) for high8 registers (AH/BH/CH/DH) on Sandybridge through Skylake, and for low8/low16 on SnB. (I've unfortunately been spreading mis-information in the past, and saying that Haswell can merge AH for free. I skimmed Agner's Haswell section too quickly, and didn't notice the later paragraph about high8 registers. Let me know if you see my wrong comments on other posts, so I can delete them or add a correction. I will try to at least find and edit my answers where I've said this.)
My actual questions: How exactly do partial registers really behave on Skylake?
Is everything the same from IvyBridge to Skylake, including the high8 extra latency?
Intel's optimization manual is not specific about which CPUs have false dependencies for what (although it does mention that some CPUs have them), and leaves out things like reading AH/BH/CH/DH (high8 registers) adding extra latency even when they haven't been modified.
If there's any P6-family (Core2/Nehalem) behaviour that Agner Fog's microarch guide doesn't describe, that would be interesting too, but I should probably limit the scope of this question to just Skylake or Sandybridge-family.
My Skylake test data, from putting %rep 4
short sequences inside a small dec ebp/jnz
loop that runs 100M or 1G iterations. I measured cycles with Linux perf
the same way as in my answer here, on the same hardware (desktop Skylake i7 6700k).
Unless otherwise noted, each instruction runs as 1 fused-domain uop, using an ALU execution port. (Measured with ocperf.py stat -e ...,uops_issued.any,uops_executed.thread
). This detects (absence of) mov-elimination and extra merging uops.
The "4 per cycle" cases are an extrapolation to the infinitely-unrolled case. Loop overhead takes up some of the front-end bandwidth, but anything better than 1 per cycle is an indication that register-renaming avoided the write-after-write output dependency, and that the uop isn't handled internally as a read-modify-write.
Writing to AH only: prevents the loop from executing from the loopback buffer (aka the Loop Stream Detector (LSD)). Counts for lsd.uops
are exactly 0 on HSW, and tiny on SKL (around 1.8k) and don't scale with the loop iteration count. Probably those counts are from some kernel code. When loops do run from the LSD, lsd.uops ~= uops_issued
to within measurement noise. Some loops alternate between LSD or no-LSD (e.g when they might not fit into the uop cache if decode starts in the wrong place), but I didn't run into that while testing this.
- repeated
mov ah, bh
and/ormov ah, bl
runs at 4 per cycle. It takes an ALU uop, so it's not eliminated likemov eax, ebx
is. - repeated
mov ah, [rsi]
runs at 2 per cycle (load throughput bottleneck). - repeated
mov ah, 123
runs at 1 per cycle. (A dep-breakingxor eax,eax
inside the loop removes the bottleneck.) repeated
setz ah
orsetc ah
runs at 1 per cycle. (A dep-breakingxor eax,eax
lets it bottleneck on p06 throughput forsetcc
and the loop branch.)Why does writing
ah
with an instruction that would normally use an ALU execution unit have a false dependency on the old value, whilemov r8, r/m8
doesn't (for reg or memory src)? (And what aboutmov r/m8, r8
? Surely it doesn't matter which of the two opcodes you use for reg-reg moves?)repeated
add ah, 123
runs at 1 per cycle, as expected.- repeated
add dh, cl
runs at 1 per cycle. - repeated
add dh, dh
runs at 1 per cycle. - repeated
add dh, ch
runs at 0.5 per cycle. Reading [ABCD]H is special when they're "clean" (in this case, RCX is not recently modified at all).
Terminology: All of these leave AH (or DH) "dirty", i.e. in need of merging (with a merging uop) when the rest of the register is read (or in some other cases). i.e. that AH is renamed separately from RAX, if I'm understanding this correctly. "clean" is the opposite. There are many ways to clean a dirty register, the simplest being inc eax
or mov eax, esi
.
Writing to AL only: These loops do run from the LSD: uops_issue.any
~= lsd.uops
.
- repeated
mov al, bl
runs at 1 per cycle. An occasional dep-breakingxor eax,eax
per group lets OOO execution bottleneck on uop throughput, not latency. - repeated
mov al, [rsi]
runs at 1 per cycle, as a micro-fused ALU+load uop. (uops_issued=4G + loop overhead, uops_executed=8G + loop overhead). A dep-breakingxor eax,eax
before a group of 4 lets it bottleneck on 2 loads per clock. - repeated
mov al, 123
runs at 1 per cycle. - repeated
mov al, bh
runs at 0.5 per cycle. (1 per 2 cycles). Reading [ABCD]H is special. xor eax,eax
+ 6xmov al,bh
+dec ebp/jnz
: 2c per iter, bottleneck on 4 uops per clock for the front-end.- repeated
add dl, ch
runs at 0.5 per cycle. (1 per 2 cycles). Reading [ABCD]H apparently creates extra latency fordl
. - repeated
add dl, cl
runs at 1 per cycle.
I think a write to a low-8 reg behaves as a RMW blend into the full reg, like add eax, 123
would be, but it doesn't trigger a merge if ah
is dirty. So (other than ignoring AH
merging) it behaves the same as on CPUs that don't do partial-reg renaming at all. It seems AL
is never renamed separately from RAX
?
inc al
/inc ah
pairs can run in parallel.mov ecx, eax
inserts a merging uop ifah
is "dirty", but the actualmov
is renamed. This is what Agner Fog describes for IvyBridge and later.- repeated
movzx eax, ah
runs at one per 2 cycles. (Reading high-8 registers after writing full regs has extra latency.) movzx ecx, al
has zero latency and doesn't take an execution port on HSW and SKL. (Like what Agner Fog describes for IvyBridge, but he says HSW doesn't rename movzx).movzx ecx, cl
has 1c latency and takes an execution port. (mov-elimination never works for thesame,same
case, only between different architectural registers.)A loop that inserts a merging uop every iteration can't run from the LSD (loop buffer)?
I don't think there's anything special about AL/AH/RAX vs. B*, C*, DL/DH/RDX. I have tested some with partial regs in other registers (even though I'm mostly showing AL
/AH
for consistency), and have never noticed any difference.
How can we explain all of these observations with a sensible model of how the microarch works internally?
Related: Partial flag issues are different from partial register issues. See INC instruction vs ADD 1: Does it matter? for some super-weird stuff with shr r32,cl
(and even shr r32,2
on Core2/Nehalem: don't read flags from a shift other than by 1).
See also Problems with ADC/SBB and INC/DEC in tight loops on some CPUs for partial-flag stuff in adc
loops.
Other answers welcome to address Sandybridge and IvyBridge in more detail. I don't have access to that hardware.
I haven't found any partial-reg behaviour differences between HSW and SKL. On Haswell and Skylake, everything I've tested so far supports this model:
AL is never renamed separately from RAX (or r15b from r15). So if you never touch the high8 registers (AH/BH/CH/DH), everything behaves exactly like on a CPU with no partial-reg renaming (e.g. AMD).
Write-only access to AL merges into RAX, with a dependency on RAX. For loads into AL, this is a micro-fused ALU+load uop that executes on p0156, which is one of the strongest pieces of evidence that it's truly merging on every write, and not just doing some fancy double-bookkeeping as Agner speculated.
Agner (and Intel) say Sandybridge can require a merging uop for AL, so it probably is renamed separately from RAX. For SnB, Intel's optimization manual (section 3.5.2.4 Partial Register Stalls) says
I think they're saying that on SnB,
add al,bl
will RMW the full RAX instead of renaming it separately, because one of the source registers is (part of) RAX. My guess is that this doesn't apply for a load likemov al, [rbx + rax]
;rax
in an addressing mode probably doesn't count as a source.I haven't tested whether high8 merging uops still have to issue/rename on their own on HSW/SKL. That would make the front-end impact equivalent to 4 uops (since that's the issue/rename pipeline width).
xor al,al
doesn't help, and neither doesmov al, 0
.movzx ebx, al
has zero latency (renamed), and needs no execution unit. (i.e. mov-elimination works on HSW and SKL). It triggers merging of AH if it's dirty, which I guess is necessary for it to work without an ALU. It's probably not a coincidence that Intel dropped low8 renaming in the same uarch that introduced mov-elimination. (Agner Fog's micro-arch guide has a mistake here, saying that zero-extended moves are not eliminated on HSW or SKL, only IvB.)movzx eax, al
is not eliminated at rename. mov-elimination on Intel never works for same,same.mov rax,rax
isn't eliminated either, even though it doesn't have to zero-extend anything. (Although there'd be no point to giving it special hardware support, because it's just a no-op, unlikemov eax,eax
). Anyway, prefer moving between two separate architectural registers when zero-extending, whether it's with a 32-bitmov
or an 8-bitmovzx
.movzx eax, bx
is not eliminated at rename on HSW or SKL. It has 1c latency and uses an ALU uop. Intel's optimization manual only mentions zero-latency for 8-bit movzx (and points out thatmovzx r32, high8
is never renamed).High-8 regs can be renamed separately from the rest of the register, and do need merging uops.
ah
withmov ah, r8
ormov ah, [mem]
do rename AH, with no dependency on the old value. These are both instructions that wouldn't normally need an ALU uop (for the 32-bit version).inc ah
) dirties it.setcc ah
depends on the oldah
, but still dirties it. I thinkmov ah, imm8
is the same, but haven't tested as many corner cases.(Unexplained: a loop involving
setcc ah
can sometimes run from the LSD, see thercr
loop at the end of this post. Maybe as long asah
is clean at the end of the loop, it can use the LSD?).If
ah
is dirty,setcc ah
merges into the renamedah
, rather than forcing a merge intorax
. e.g.%rep 4
(inc al
/test ebx,ebx
/setcc ah
/inc al
/inc ah
) generates no merging uops, and only runs in about 8.7c (latency of 8inc al
slowed down by resource conflicts from the uops forah
. Also theinc ah
/setcc ah
dep chain).I think what's going on here is that
setcc r8
is always implemented as a read-modify-write. Intel probably decided that it wasn't worth having a write-onlysetcc
uop to optimize thesetcc ah
case, since it's very rare for compiler-generated code tosetcc ah
. (But see the godbolt link in the question: clang4.0 with-m32
will do so.)reading AX, EAX, or RAX triggers a merge uop (which takes up front-end issue/rename bandwidth). Probably the RAT (Register Allocation Table) tracks the high-8-dirty state for the architectural R[ABCD]X, and even after a write to AH retires, the AH data is stored in a separate physical register from RAX. Even with 256 NOPs between writing AH and reading EAX, there is an extra merging uop. (ROB size=224 on SKL, so this guarantees that the
mov ah, 123
was retired). Detected with uops_issued/executed perf counters, which clearly show the difference.Read-modify-write of AL (e.g.
inc al
) merges for free, as part of the ALU uop. (Only tested with a few simple uops, likeadd
/inc
, notdiv r8
ormul r8
). Again, no merging uop is triggered even if AH is dirty.Write-only to EAX/RAX (like
lea eax, [rsi + rcx]
orxor eax,eax
) clears the AH-dirty state (no merging uop).mov ax, 1
) triggers a merge of AH first. I guess instead of special-casing this, it runs like any other RMW of AX/RAX. (TODO: testmov ax, bx
, although that shouldn't be special because it's not renamed.)xor ah,ah
has 1c latency, is not dep-breaking, and still needs an execution port.add ah, cl
/add al, dl
can run at 1 per clock (bottlenecked on add latency).Making AH dirty prevents a loop from running from the LSD (the loop-buffer), even when there are no merging uops. The LSD is when the CPU recycles uops in the queue that feeds the issue/rename stage. (Called the IDQ).
Inserting merging uops is a bit like inserting stack-sync uops for the stack-engine. Intel's optimization manual says that SnB's LSD can't run loops with mismatched
push
/pop
, which makes sense, but it implies that it can run loops with balancedpush
/pop
. That's not what I'm seeing on SKL: even balancedpush
/pop
prevents running from the LSD (e.g.push rax
/pop rdx
/times 6 imul rax, rdx
. (There may be a real difference between SnB's LSD and HSW/SKL: SnB may just "lock down" the uops in the IDQ instead of repeating them multiple times, so a 5-uop loop takes 2 cycles to issue instead of 1.25.) Anyway, it appears that HSW/SKL can't use the LSD when a high-8 register is dirty, or when it contains stack-engine uops.This behaviour may be related to a an erratum in SKL:
This may also be related to Intel's optimization manual statement that SnB at least has to issue/rename an AH-merge uop in a cycle by itself. That's a weird difference for the front-end.
My Linux kernel log says
microcode: sig=0x506e3, pf=0x2, revision=0x84
. Arch Linux'sintel-ucode
package just provides the update, you have to edit config files to actually have it loaded. So my Skylake testing was on an i7-6700k with microcode revision 0x84, which doesn't include the fix for SKL150. It matches the Haswell behaviour in every case I tested, IIRC. (e.g. both Haswell and my SKL can run thesetne ah
/add ah,ah
/rcr ebx,1
/mov eax,ebx
loop from the LSD). I have HT enabled (which is a pre-condition for SKL150 to manifest), but I was testing on a mostly-idle system so my thread had the core to itself.With updated microcode, the LSD is completely disabled for everything all the time, not just when partial registers are active.
lsd.uops
is always exactly zero, including for real programs not synthetic loops. Hardware bugs (rather than microcode bugs) often require disabling a whole feature to fix. This is why SKL-avx512 (SKX) is reported to not have a loopback buffer. Fortunately this is not a performance problem: SKL's increased uop-cache throughput over Broadwell can almost always keep up with issue/rename.Extra AH/BH/CH/DH latency:
add bl, ah
has a latency of 2c from input BL to output BL, so it can add latency to the critical path even if RAX and AH are not part of it. (I've seen this kind of extra latency for the other operand before, with vector latency on Skylake, where an int/float delay "pollutes" a register forever. TODO: write that up.)This means unpacking bytes with
movzx ecx, al
/movzx edx, ah
has extra latency vs.movzx
/shr eax,8
/movzx
, but still better throughput.Reading AH when it is dirty doesn't add any latency. (
add ah,ah
oradd ah,dh
/add dh,ah
have 1c latency per add). I haven't done a lot of testing to confirm this in many corner-cases.Hypothesis: a dirty high8 value is stored in the bottom of a physical register. Reading a clean high8 requires a shift to extract bits [15:8], but reading a dirty high8 can just take bits [7:0] of a physical register like a normal 8-bit register read.
Extra latency doesn't mean reduced throughput. This program can run at 1 iter per 2 clocks, even though all the
add
instructions have 2c latency (from reading DH, which is not modified.)Some interesting test loop bodies:
The setcc version (with the
%if 1
) has 20c loop-carried latency, and runs from the LSD even though it hassetcc ah
andadd ah,ah
.Unexplained: it runs from the LSD, even though it makes AH dirty. (At least I think it does. TODO: try adding some instructions that do something with
eax
before themov eax,ebx
clears it.)But with
mov ah, bl
, it runs in 5.0c per iteration (imul
throughput bottleneck) on both HSW/SKL. (The commented-out store/reload works, too, but SKL has faster store-forwarding than HSW, and it's variable-latency...)Notice that it doesn't run from the LSD anymore.