The Intel Optimization Reference, under Section 3.5.1, advises:
"Favor single-micro-operation instructions."
"Avoid using complex instructions (for example, enter, leave, or loop) that have more than 4 micro-ops and require multiple cycles to decode. Use sequences of simple instructions instead."
Although Intel themselves tell compiler writers to use instructions which decode to few micro-ops, I can't find anything in any of their manuals which explains just how many micro-ops each ASM instruction decodes to! Is this information available anywhere? (Of course, I expect that the answers will be different for different generations of CPUs.)
Agner Fog's PDF document on x86 instructions (linked off of the main page Hans cites) is the only reference I've found on instruction timings and micro-ops. I've never seen an Intel document on micro-op breakdown.
In addition to the resources already mentioned in the other answers (Agner Fog's tables and IACA), you can find detailed information on the μops of most x86 instructions on recent Intel CPUs (from Nehalem to Cannon Lake) on our website uops.info. The website also contains information on the latency and throughput of each instruction. The data was obtained by running automatically generated microbenchmarks both on the actual hardware (using hardware performance counters) and on top of different versions of Intel IACA.
Compared to Agner Fog's instruction tables, the data on uops.info is in several cases more accurate and precise. As an example, consider the PBLENDVB instruction on Nehalem. According to Agner Fog's tables, the instruction has one μop that can only use port 0, and one μop that can only use port 5. This is probably based on the observation that when executing the instruction repeatedly in isolation, there is, on average, one μop on port 0, and one μop on port 5. The microbenchmarks on uops.info show that actually both μops can use port 0 and port 5. This is determined by executing the instruction together with instructions that can only use port 0 or port 5.
The data on uops.info also reveals several inaccuracies in Intel's IACA. For example, on Skylake both μops of the CVTPI2PS XMM, MM instruction can only use port 0 in IACA (http://uops.info/html-ports/SKL/CVTPI2PS_XMM_MM-IACA3.0.html). On the actual hardware, there is one μop that can only use port 0, and one μop that can use both port 0 and port 1. Agner Fog also observed that one μop of this instruction can use port 1; however, he claims that this μop can only use port 1, which is incorrect.
It has already been pointed out that Agner Fog's optimization manuals are an excellent resource, and in particular, his Instruction Tables, which are nearly comprehensive for all of the x86 microarchitectures of interest.
But you do have another option: Intel's Architecture Code Analyzer (IACA). There is a write-up of how to use it here on Stack Overflow, but it's pretty simple to get going (although a bit tedious for one-off analysis). You just download the executable, emit some prologue and epilogue code surrounding the block of instructions you want to be analyzed (it includes a C header for this purpose (iacaMarks.h
) that works with various compilers, or you can just instruct your assembler to emit the appropriate bytes), and then run your binary through iaca.exe
. The current version (v2.2) supports only 64-bit binaries, but that's not a major limitation, since the instruction-level analysis won't be substantially different for 32-bit and 64-bit modes. The current version also supports all modern Intel microarchitectures that might be of interest to a professional software developer, from Nehalem to Broadwell.
The output you get from this tool will tell you which ports a particular instruction can execute on, as well as how many µops that instruction will decompose to, on the specified microarchitecture.
That is as close as you're going to get to a direct answer to your question, since as Hans Passant pointed out in the comments, the exact µops that each instruction decomposes to are intentionally kept secret by Intel. Not only are they a proprietary trade secret, but Intel wants to be free to change how it works from one microarchitecture to another. In fact, though, how many µops an instruction decomposes to is all you would ever want to know when optimizing code. It doesn't matter which µops the instruction decomposes to.
But I would reiterate one portion of Peter Cordes's answer: "It's easy to guess in some cases, though". If you have to look up this type of detailed information for each instruction you're considering, you're going to waste a lot of time. You're also going to drive yourself mad, since as you already know, it varies from one microarchitecture to another. The real trick here is getting an intuitive feel for which instructions in the x86 ISA are "simple" and which are "complex". It should be pretty obvious from reading the documentation, and that intuitive feeling is really all that Intel's optimization recommendations are driving you towards. Avoid "complex" (old CISC-style) instructions like LOOP
, ENTER
, LEAVE
, and so forth. For example, prefer DEC
+JNZ
over LOOP
. Relatively speaking, there are only a small minority of "classic" x86 instructions that decode to more than one or two µops.* Studying the output of a good optimizing compiler will also lead you in the right direction, since you'll never see compilers use these "complex" instructions.
Somewhat contra Peter's answer, though, I'm pretty sure that the quoted section of Intel's optimization manuals are not referring to the SIMD instructions. They're talking about the old-school CISC instructions that are implemented in microcode and that they would have dropped already if they didn't have to support them for backwards compatibility. If you need the behavior of SSE3's HADDPS
, then you are probably better off using HADDPS
instead of trying to break it down into "simpler" components. (Unless, of course, you can better schedule those operations by interleaving them within unrelated code. But that's awfully hard to do in practice.)
*To be completely accurate, there are certain seemingly-simple instructions that are actually implemented using microcode and decompose to multiple µops. A 64-bit division (DIV
) is an example. If I remember correctly, this is microcoded using something like 30–40 µops (variable). However, this is not an instruction that you should avoid, which shows that Intel's manuals are being very general in dispensing advice here. If you need to do a division, use a DIV
. Obviously prefer not doing divisions when optimizing for speed, but don't try to write your own division algorithm just to avoid the microcoded DIV
, either!
The other big exception here are the string instructions. The performance calculus for those is a bit more complicated than "avoid because they decode to multiple µops", though.
Fortunately, one thing is simple: never use the string instructions without a REP
prefix. That just doesn't make sense, and you will get significantly better performance by "decomposing" the instruction into the simpler "component" instructions yourself—for example, MOVSB
→ MOV AL, [ESI]
+MOV ES:[EDI], AL
+INC/DEC ESI
+INC/DEC EDI
.
Where it gets a bit trickier to decide is when you start taking advantage of the REP
prefix. Although this does cause the instruction to decode into many µops, it is sometimes still more efficient to use the repeated string instructions than to code the loop manually yourself. But not always. There's been lots of discussion on this issue already on Stack Overflow and elsewhere; for example, see this question.
A detailed analysis is really beyond the scope of this answer, but my quick rule of thumb is that you can forget about REP LOADS
, REP SCAS
, and REP CMPS
entirely. On the other hand, REP MOVS
and REP STOS
are useful when you need to repeat a reasonably large number of times. Always use the largest word size possible: DWORD on 32-bit, QWORD on 64-bit (but note that on modern processors, you may be better off using MOVSB
/STOSB
, since they can move larger quantities internally. And even when all these conditions are met, if your target has vector instructions available, you probably want to verify that it wouldn't be faster to implement the move/store with vector moves.
See also Agner Fog's general advice on page 150.
Agner Fog's insn tables show which port micro-ops run on, which is all that matters for performance. It doesn't show exactly what each uop does, because that's not something you can reverse-engineer. (i.e. which execution unit it uses on that port).
It's easy to guess in some cases, though: haddps
on Haswell is 1 uop for port, and 2 uops for port 5. That's pretty obviously 2 shuffles (port 5) and an FP-add (port 1). There are lots of other execution units on port 5, e.g. vector boolean, SIMD integer add, and lots of scalar integer stuff, but given that haddps
needs multiple uops at all, it's pretty obvious that Intel implements it with shuffles and a regular "vertical" add uop.
It might be possible to figure out something about the dependency relationship between those uops (e.g. is it 2 shufps-style shuffles feeding an FP add, or is it shuffle-add-shuffle?). We also aren't sure whether the shuffles are independent of each other or not: Haswell only has one shuffle port, so the resource conflict would give us 5c total latency because the shuffles couldn't run in parallel even if they were independent.
Both shuffle uops probably need both inputs, so even if they're independent of each other, having one input ready sooner than the other doesn't improve the latency for the critical-path (from the slower input to the output).
If it was possible to implement HADDPS with 2 independent one-input shuffles, that would mean that HADDPS xmm0, xmm1 in a loop where xmm1 was a constant would only add 4c of latency to the dep chain involving xmm0. I haven't measured, but I think it's unlikely; almost certainly it's two independent 2-input shuffles to feeding an ADDPS uop.