From open resources I can conclude that microcode is approximately something that can be executed directly by CPU and is responsible for implementing instruction codes. Also Wikipedia indicates that every execution of instruction code would go through a fetch-decode-execute instruction cycle. However, I cannot find any references explaining how microcode execution is done during this three-phase cycle. So my question is, what's the relationship of microcode execution and instruction cycle? How does microcodes do their work during fetch, decode and execute phase of an instruction execution?

Also this stackoverflow anwser says that in modern Intel CPUs even the simplest instructions like DIV and MOV would be compiled in microcodes before executing, so it would be best if anyone could explain it with examples from such CPUs, if that is indeed true.

div is not simple, it's one of the hardest integer operations to compute! It's microcoded on Intel CPUs, unlike mov, or add/sub or even imul which are all single-uop on modern Intel. See https://agner.org/optimize/ for instruction tables and microarch guides. (Fun fact: AMD Ryzen doesn't microcode div; it's only 2 uops because it has to write 2 output registers. Piledriver and later also make 32 and 64-bit division 2 uops.)

All instructions decode to 1 or more uops (with most instructions in most programs being 1 uop on current CPUs). Instructions which decode to 4 or fewer uops on Intel CPUs are described as "not microcoded", because they don't use the special MSROM mechanism for many-uop instructions.

No CPUs that decode x86 instructions to uops use a simple 3-phase fetch/decode/exec cycle, so that part of the premise of your question makes no sense. Again, see Agner Fog's microarch guide.

Are you sure you wanted to ask about modern Intel CPUs? Some older CPUs are internally microcoded, especially non-pipelined CPUs where the process of executing different instructions can activate different internal logic blocks in a different order. The logic that controls this is also called microcode, but it's a different kind of microcode from the modern meaning of the term in the context of a pipelined out-of-order CPU.

If that's what you're looking for, see How was microcode implemented in retro processors? on retrocomputing.SE for non-pipelined CPUs like 6502 and Z80, where some of the microcode internal timing cycles are documented.

How do microcoded instructions execute on modern Intel CPUs?

When a microcoded "indirect uop" reaches the head of the IDQ in a Sandybridge-family CPU, it takes over the issue/rename stage and feeds it uops from the microcode-sequencer MS-ROM until the instruction has issued all its uops, then the front-end can resume issuing other uops into the out-of-order back-end.

The IDQ is the Instruction Decode Queue that feeds the issue/rename stage (which sends uops from the front-end into the out-of-order back-end). It buffers uops that come from the uop cache + legacy decoders, to absorb bubbles and bursts. It's the 56 uop queue in David Kanter's Haswell block diagram. (But that shows microcode only being read before the queue, which doesn't match Intel's description of some perf events¹, or what has to happen for microcoded instructions that run a data-dependent number of uops).

(This might not be 100% accurate, but at least works as a mental model for most of the performance implications². There might be other explanations for the performance effects we've observed so far.)

This only happens for instructions that need more than 4 uops; instructions that need 4 or fewer decode to separate uops in the normal decoders and can issue normally. e.g. xchg eax, ecx is 3 uops on modern Intel: Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? goes into detail about what we can figure out about what those uops actually are.

The special "indirect" uop for a microcoded instruction takes a whole line to itself in the decoded-uop cache, the DSB (potentially causing code-alignment performance issue). I'm not sure if they only take 1 entry in the queue that feeds the issue stage from the uop cache and/or legacy decoders, the IDQ. Anyway, I made up the term "indirect uop" to describe it. It's really more like a not-yet-decoded instruction or a pointer into the MS-ROM. (Possibly some microcoded instructions might be a couple "normal" uops and one microcode pointer; that could explain it taking a whole uop-cache line to itself.)

I'm pretty sure they don't fully expand until they reach the head of the queue, because some microcoded instructions are a variable number of uops depending on data in registers. Notably rep movs which basically implements memcpy. In fact this is tricky; with different strategies depending on alignment and size, rep movs actually needs to do a some conditional branching. But it's jumping to different MS-ROM locations, not to different x86 machine-code locations (RIP values). See Conditional jump instructions in MSROM procedures?.

Intel's fast-strings patent also sheds some light on the original implementation in P6: first n copy iterations are predicated in the back-end; and give the back-end time to send the value of ECX to the MS. From that, the microcode sequencer can send exactly the right number of copy uops if more are needed, with no branching in the back-end needed. Maybe the mechanism for handling nearly-overlapping src and dst or other special cases aren't based on branching after all, but Andy Glew did mention lack of microcode branch prediction as an issue for the implementation. So we know they are special. And that was back in P6 days; rep movsb is more complicated now.

Depending on the instruction, it might or might not drain the out-of-order back end's reservation station aka scheduler while sorting out what to do. rep movs does that for copies > 96 bytes on Skylake, unfortunately (according to my testing with perf counters, putting rep movs between independent chains of imul). This might be due to mispredicted microcode branches, which aren't like regular branches. Maybe branch-miss fast-recovery doesn't work on them, so they aren't detected / handled until they reach retirement? (See the microcode branch Q&A for more about this).

rep movs is very different from mov. Normal mov like mov eax, [rdi + rcx*4] is a single uop even with a complex addressing mode. A mov store is 1 micro-fused uop, including both a store-address and store-data uop that can execute in either order, writing the data and physical address into the store buffer so the store can commit to L1d after the instruction retires from the out-of-order back-end and becomes non-speculative. The microcode for rep movs will include many load and store uops.

Footnote 1:

We know there are perf events like idq.ms_dsb_cycles on Skylake:

[Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser[sic] (MS) is busy]

That would make no sense if microcode is just a 3rd possible source of uops to feed into the front of the IDQ. But then there's an event whose descriptions sounds like that:

idq.ms_switches
[Number of switches from DSB (Decode Stream Buffer) or MITE (legacy decode pipeline) to the Microcode Sequencer]

I think this actually means it counts when the issue/rename stage switches to taking uops from the microcode sequencer instead of the IDQ (which holds uops from DSB and/or MITE). Not that the IDQ switches its source of incoming uops.

Footnote 2:

To test this theory, we could construct a test case with lots of easily-predicted jumps to cold i-cache lines after a microcoded instruction, and see how far the front-end gets in following cache misses and queueing up uops into the IDQ and other internal buffers during the execution of a big rep scasb.

SCASB doesn't have fast-strings support, so it's very slow and doesn't touch a huge amount of memory per cycle. We want it to hit in L1d so timing is highly predictable. Probably a couple 4k pages are enough time for the front-end to follow a lot of i-cache misses. We can even map contiguous virtual pages to the same physical page (e.g. from user-space with mmap on a file)

If the IDQ space behind the microcoded instruction can be filled up with later instructions while it's executing, that leaves more room for the front-end to fetch from more i-cache lines ahead of when they're needed. We can then hopefully detect the difference with total cycles and/or other perf counters, for running rep scasb plus a sequence of jumps. Before each test, use clflushopt on the lines holding the jump instructions.

To test rep movs this way, we could maybe play tricks with virtual memory to get contiguous pages mapped to the same physical page, again giving us L1d hits for loads + stores, but dTLB delays would be hard to control. Or even boot with the CPU in no-fill mode, but that's very hard to use and would need a custom "kernel" to put the result somewhere visible.

I'm pretty confident we would find uops entering the IDQ while a microcoded instruction has taken over the front-end (if it wasn't already full). There is a perf event

idq.ms_uops
[Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy]

and 2 other events like that which count only uops coming from MITE (legacy decode) or uops coming from DSB (uop cache). Intel's description of those events is compatible with my description of how a microcoded instruction ("indirect uop") takes over the issue stage to read uops from the microcode sequencer / ROM while the rest of the front-end continues doing its thing delivering uops to the other end of the IDQ until it fills up.