I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA).
The following instruction using [base+index]
addressing
addps xmm1, xmmword ptr [rsi+rax*1]
does not micro-fuse according to IACA. However, if I use [base+offset]
like this
addps xmm1, xmmword ptr [rsi]
IACA reports that it does fuse.
Section 2-11 of the Intel optimization reference manual gives the following as an example "of micro-fused micro-ops that can be handled by all decoders"
FADD DOUBLE PTR [RDI + RSI*8]
and Agner Fog's optimization assembly manual also gives examples of micro-op fusion using [base+index]
addressing. See, for example, Section 12.2 "Same example on Core2". So what's the correct answer?
I have now reviewed test results for Intel Sandy Bridge, Ivy Bridge, Haswell and Broadwell. I have not had access to test on a Skylake yet. The results are:
Your results may be due to other factors. I have not tried to use the IACA.
In the decoders and uop-cache, addressing mode doesn't affect micro-fusion (except that an instruction with an immediate operand can't micro-fuse a RIP-relative addressing mode).
But some combinations of uop and addressing mode can't stay micro-fused in the ROB (in the out-of-order core), so Intel SnB-family CPUs "un-laminate" when necessary, at some point before the issue/rename stage. For issue-throughput, and out-of-order window size (ROB-size), fused-domain uop count after un-lamination is what matters.
Intel's optimization manual describes un-lamination for Sandybridge in Section 2.3.2.4: Micro-op Queue and the Loop Stream Detector (LSD), but doesn't describe the changes for any later microarchitectures.
The rules, as best I can tell from experiments on SnB, HSW, and SKL:
adc
andcmov
don't micro-fuse. Most VEX-encoded instructions also don't fuse since they generally have three operands (sopaddb xmm0, [rdi+rbx]
fuses butvpaddb xmm0, xmm0, [rdi+rbx]
doesn't). Finally, the occasional 2-operand instruction where the first operand is write only, such aspabsb xmm0, [rax + rbx]
also do not fuse. IACA is wrong, applying the SnB rules.Related: simple (non-indexed) addressing modes are the only ones that the dedicated store-address unit on port7 (Haswell and later) can handle, so it's still potentially useful to avoid indexed addressing modes for stores. (A good trick for this is to address your dst with a single register, but src with
dst+(initial_src-initial_dst)
. Then you only have to increment the dst register inside a loop.)Note that some instructions never micro-fuse at all (even in the decoders/uop-cache). e.g.
shufps xmm, [mem], imm8
, orvinsertf128 ymm, ymm, [mem], imm8
, are always 2 uops on SnB through Skylake, even though their register-source versions are only 1 uop. This is typical for instructions with an imm8 control operand plus the usual dest/src1, src2 register/memory operands, but there are a few other cases. e.g.PSRLW/D/Q xmm,[mem]
(vector shift count from a memory operand) doesn't micro-fuse, and neither does PMULLD.See also this post on Agner Fog's blog for discussion about issue throughput limits on HSW/SKL when you read lots of registers: Lots of micro-fusion with indexed addressing modes can lead to slowdowns vs. the same instructions with fewer register operands: one-register addressing modes and immediates. We don't know the cause yet, but I suspect some kind of register-read limit, maybe related to reading lots of cold registers from the PRF.
Test cases, numbers from real measurements: These all micro-fuse in the decoders, AFAIK, even if they're later un-laminated.
Three-input instructions that HSW/SKL may have to un-laminate
I assume that Broadwell behaves like Skylake for adc/cmov.
It's strange that HSW un-laminates memory-source ADC and CMOV. Maybe Intel didn't get around to changing that from SnB before they hit the deadline for shipping Haswell.
Agner's insn table says
cmovcc r,m
andadc r,m
don't micro-fuse at all on HSW/SKL, but that doesn't match my experiments. The cycle counts I'm measuring match up with the the fused-domain uop issue count, for a 4 uops / clock issue bottleneck. Hopefully he'll double-check that and correct the tables.Memory-dest integer ALU:
Yes, that's right,
adc [rdi],eax
/dec ecx
/jnz
runs faster than the same loop withadd
instead ofadc
on SKL. I didn't try using different addresses, since clearly SKL doesn't like repeated rewrites of the same address (store-forwarding latency higher than expected. See also this post about repeated store/reload to the same address being slower than expected on SKL.Memory-destination
adc
is so many uops because Intel P6-family (and apparently SnB-family) can't keep the same TLB entries for all the uops of a multi-uop instruction, so it needs an extra uop to work around the problem-case where the load and add complete, and then the store faults, but the insn can't just be restarted because CF has already been updated. Interesting series of comments from Andy Glew (@krazyglew).Presumably fusion in the decoders and un-lamination later saves us from needing microcode ROM to produce more than 4 fused-domain uops from a single instruction for
adc [base+idx], reg
.Why SnB-family un-laminates:
Sandybridge simplified the internal uop format to save power and transistors (along with making the major change to using a physical register file, instead of keeping input / output data in the ROB). SnB-family CPUs only allow a limited number of input registers for a fused-domain uop in the out-of-order core. For SnB/IvB, that limit is 2 inputs (including flags). For HSW and later, the limit is 3 inputs for a uop. I'm not sure if memory-destination
add
andadc
are taking full advantage of that, or if Intel had to get Haswell out the door with some instructionsNehalem and earlier have a limit of 2 inputs for an unfused-domain uop, but the ROB can apparently track micro-fused uops with 3 input registers (the non-memory register operand, base, and index).
So indexed stores and ALU+load instructions can still decode efficiently (not having to be the first uop in a group), and don't take extra space in the uop cache, but otherwise the advantages of micro-fusion are essentially gone for tuning tight loops. "un-lamination" happens before the 4-fused-domain-uops-per-cycle issue/retire width out-of-order core. The fused-domain performance counters (uops_issued / uops_retired.retire_slots) count fused-domain uops after un-lamination.
Intel's description of the renamer (Section 2.3.3.1: Renamer) implies that it's the issue/rename stage which actually does the un-lamination, so uops destined for un-lamination may still be micro-fused in the 28/56/64 fused-domain uop issue queue / loop-buffer (aka the IDQ).
TODO: test this. Make a loop that should just barely fit in the loop buffer. Change something so one of the uops will be un-laminated before issuing, and see if it still runs from the loop buffer (LSD), or if all the uops are now re-fetched from the uop cache (DSB). There are perf counters to track where uops come from, so this should be easy.
Harder TODO: if un-lamination happens between reading from the uop cache and adding to the IDQ, test whether it can ever reduce uop-cache bandwidth. Or if un-lamination happens right at the issue stage, can it hurt issue throughput? (i.e. how does it handle the leftover uops after issuing the first 4.)
(See the a previous version of this answer for some guesses based on tuning some LUT code, with some notes on
vpgatherdd
being about 1.7x more cycles than apinsrw
loop.)Experimental testing on SnB
The HSW/SKL numbers were measured on an i5-4210U and an i7-6700k. Both had HT enabled (but the system idle so the thread had the whole core to itself). I ran the same static binaries on both systems, Linux 4.10 on SKL and Linux 4.8 on HSW, using
ocperf.py
. (The HSW laptop NFS-mounted my SKL desktop's /home.)The SnB numbers were measured as described below, on an i5-2500k which is no longer working.
Confirmed by testing with performance counters for uops and cycles.
I found a table of PMU events for Intel Sandybridge, for use with Linux's
perf
command. (Standardperf
unfortunately doesn't have symbolic names for most hardware-specific PMU events, like uops.) I made use of it for a recent answer.ocperf.py
provides symbolic names for these uarch-specific PMU events, so you don't have to look up tables. Also, the same symbolic name works across multiple uarches. I wasn't aware of it when I first wrote this answer.To test for uop micro-fusion, I constructed a test program that is bottlenecked on the 4-uops-per-cycle fused-domain limit of Intel CPUs. To avoid any execution-port contention, many of these uops are
nop
s, which still sit in the uop cache and go through the pipeline the same as any other uop, except they don't get dispatched to an execution port. (Anxor x, same
, or an eliminated move, would be the same.)Test program:
yasm -f elf64 uop-test.s && ld uop-test.o -o uop-test
I also found that the uop bandwidth out of the loop buffer isn't a constant 4 per cycle, if the loop isn't a multiple of 4 uops. (i.e. it's
abc
,abc
, ...; notabca
,bcab
, ...). Agner Fog's microarch doc unfortunately wasn't clear on this limitation of the loop buffer. See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for more investigation on HSW/SKL. SnB may be worse than HSW in this case, but I'm not sure and don't still have working SnB hardware.I wanted to keep macro-fusion (compare-and-branch) out of the picture, so I used
nop
s between thedec
and the branch. I used 4nop
s, so with micro-fusion, the loop would be 8 uops, and fill the pipeline with at 2 cycles per 1 iteration.In the other version of the loop, using 2-operand addressing modes that don't micro-fuse, the loop will be 10 fused-domain uops, and run in 3 cycles.
Results from my 3.3GHz Intel Sandybridge (i5 2500k). I didn't do anything to get the cpufreq governor to ramp up clock speed before testing, because cycles are cycles when you aren't interacting with memory. I've added annotations for the performance counter events that I had to enter in hex.
testing the 1-reg addressing mode: no cmdline arg
testing the 2-reg addressing mode: with a cmdline arg
So, both versions ran 80M instructions, and dispatched 60M uops to execution ports. (
or
with a memory source dispatches to an ALU for theor
, and a load port for the load, regardless of whether it was micro-fused or not in the rest of the pipeline.nop
doesn't dispatch to an execution port at all.) Similarly, both versions retire 100M unfused-domain uops, because the 40M nops count here.The difference is in the counters for the fused-domain.
I suspect that you'd only see a difference between UOPS_ISSUED and UOPS_RETIRED(retirement slots used) if branch mispredicts led to uops being cancelled after issue, but before retirement.
And finally, the performance impact is real. The non-fused version took 1.5x as many clock cycles. This exaggerates the performance difference compared to most real cases. The loop has to run in a whole number of cycles, and the 2 extra uops push it from 2 to 3. Often, an extra 2 fused-domain uops will make less difference. And potentially no difference, if the code is bottlecked by something other than 4-fused-domain-uops-per-cycle.
Still, code that makes a lot of memory references in a loop might be faster if implemented with a moderate amount of unrolling and incrementing multiple pointers which are used with simple
[base + immediate offset]
addressing, instead of the using[base + index]
addressing modes.futher stuff
RIP-relative with an immediate can't micro-fuse. Agner Fog's testing shows that this is the case even in the decoders / uop-cache, so they never fuse in the first place (rather than being un-laminated).
IACA gets this wrong, and claims that both of these micro-fuse:
RIP-rel does micro-fuse (and stay fused) when there's no immediate, e.g.:
Micro-fusion doesn't increase the latency of an instruction. The load can issue before the other input is ready.
This loop runs at 5 cycles per iteration, because of the
eax
dep chain. No faster than a sequence ofor eax, [rsi + 0 + rdi]
, ormov ebx, [rsi + 0 + rdi] / or eax, ebx
. (The unfused and themov
versions both run the same number of uops.) Scheduling / dep checking happens in the unfused-domain. Newly issued uops go into the scheduler (aka Reservation Station (RS)) as well as the ROB. They leave the scheduler after dispatching (aka being sent to an execution unit), but stay in the ROB until retirement. So the out-of-order window for hiding load latency is at least the scheduler size (54 unfused-domain uops in Sandybridge, 60 in Haswell, 97 in Skylake).Micro-fusion doesn't have a shortcut for the base and offset being the same register. A loop with
or eax, [mydata + rdi+4*rdi]
(where rdi is zeroed) runs as many uops and cycles as the loop withor eax, [rsi+rdi]
. This addressing mode could be used for iterating over an array of odd-sized structs starting at a fixed address. This is probably never used in most programs, so it's no surprise that Intel didn't spend transistors on allowing this special-case of 2-register modes to micro-fuse. (And Intel documents it as "indexed addressing modes" anyway, where a register and scale factor are needed.)Macro-fusion of a
cmp
/jcc
ordec
/jcc
creates a uop that stays as a single uop even in the unfused-domain.dec / nop / jge
can still run in a single cycle but is three uops instead of one.Older Intel processors without a uop cache can do the fusion, so maybe this is a drawback of the uop cache. I don't have the time to test this right now, but I will add a test for uop fusion next time I update my test scripts. Have you tried with FMA instructions? They are the only instructions that allow 3 input dependencies in an unfused uop.
Note: Since I wrote this answer, Peter tested Haswell and Skylake as well and integrated the results into the accepted answer above (in particular, most of the improvements I attribute to Skylake below seem to have actually appeared in Haswell). You should see that answer for the rundown of behavior across CPUs and this answer (although not wrong) is mostly of historical interest.
My testing indicates that on Skylake at least1, the processor fully fuses even complex addressing modes, unlike Sandybridge.
That is, the 1-arg and 2-arg versions of the code posted above by Peter run in the same number of cycles, with the same number of uops dispatched and retired.
My results:
Performance counter stats for
./uop-test
:Performance counter stats for
./uop-test x
:Performance counter stats for
./uop-test x x
:I didn't find any UOPS_RETIRED_ANY instruction on Skylake, only the "retired slots" guy which is apparently fused-domain.
The final test (
uop-test x x
) is a variant that Peter suggestions which uses a RIP-relativecmp
with immediate, which is known not to microfuse:The results show that the extra 2 uops per cycle are picked up by the uops issued and retired counters (hence the test can differentiate between fusion occurring, and not).
More tests on other architectures are welcome! You can find the code (copied from Peter above) in github.
[1] ... and perhaps some other architectures in-between Skylake and Sandybridge, since Peter only tested SB and I only tested SKL.