A CPU reads machine code and decodes it into internal control signals that send the right data to the right execution units.
Most instructions map to one internal operation, and can be decoded directly. (e.g. on x86, add eax, edx just sends eax and edx to the integer ALU for an ADD operation, and puts the result in eax.)
Some other single instructions do much more work. e.g. x86's rep movs implements memcpy(edi, esi, ecx), and requires the CPU to loop.
When the instruction decoders see an instruction like that, instead of just producing internal control signals directly they read micro-code out of the microcode ROM.
A micro-coded instruction is one that decodes to many internal operations
Modern x86 CPUs always decode x86 instructions to internal micro-operations. In this terminology, it still doesn't count as "micro-coded" even when add [mem], eax decodes to a load from [mem], an ALU ADD operation, and a store back into [mem]. Another example is xchg eax, edx, which decodes to 3 uops on Intel Haswell. And interestingly, not exactly the same kind of uops you'd get from using 3 MOV instructions to do the exchange with a scratch register, because they aren't zero-latency.
On Intel / AMD CPUs, "micro-coded" means the decoders turn on the micro-code sequencer to feed uops from the ROM into the pipeline, instead of producing multiple uops directly.
In current Intel CPUs, the limit on what the decoders can produce directly, without going to micro-code ROM, is 4 uops (fused-domain). AMD similarly has FastPath single or double instructions, and beyond that it's VectorPath or Microcode, as explained in David Kanter's in-depth look at AMD Bulldozer, specifically talking about its decoders.
FP division is also slow, but is decoded to a single uop so it doesn't bottleneck the front-end. If FP division is rare and not part of a latency bottleneck, it can be as cheap as multiplication. (But if execution does have to wait for its result, or bottlenecks on its throughput, it's much slower.)
In some older / simpler CPUs, every instruction was effectively micro-coded. For example, the 6502 executed 6502 instructions by running a sequence of internal instructions from a PLA decode ROM. This works well for a non-pipelined CPU, where the order of using the different parts of the CPU can vary from instruction to instruction.
Historically, there was a different technical meaning for "microcode", meaning something like the internal control signals decoded from the instruction word. Especially in a CPU like MIPS where the instruction word mapped directly to those control signals, without complicated decoding. (I may have this partly wrong; I read something like this (other than in the deleted answer on this question) but couldn't find it again later.)
A CPU reads machine code and decodes it into internal control signals that send the right data to the right execution units.
Most instructions map to one internal operation, and can be decoded directly. (e.g. on x86,
add eax, edx
just sends eax and edx to the integer ALU for an ADD operation, and puts the result in eax.)Some other single instructions do much more work. e.g. x86's
rep movs
implementsmemcpy(edi, esi, ecx)
, and requires the CPU to loop.When the instruction decoders see an instruction like that, instead of just producing internal control signals directly they read micro-code out of the microcode ROM.
A micro-coded instruction is one that decodes to many internal operations
Modern x86 CPUs always decode x86 instructions to internal micro-operations. In this terminology, it still doesn't count as "micro-coded" even when
add [mem], eax
decodes to a load from[mem]
, an ALU ADD operation, and a store back into[mem]
. Another example isxchg eax, edx
, which decodes to 3 uops on Intel Haswell. And interestingly, not exactly the same kind of uops you'd get from using 3 MOV instructions to do the exchange with a scratch register, because they aren't zero-latency.On Intel / AMD CPUs, "micro-coded" means the decoders turn on the micro-code sequencer to feed uops from the ROM into the pipeline, instead of producing multiple uops directly.
In current Intel CPUs, the limit on what the decoders can produce directly, without going to micro-code ROM, is 4 uops (fused-domain). AMD similarly has FastPath single or double instructions, and beyond that it's VectorPath or Microcode, as explained in David Kanter's in-depth look at AMD Bulldozer, specifically talking about its decoders.
Another example is x86's integer DIV instruction, which is micro-coded even on modern CPUs like Intel Haswell. See my answer on Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture? for the numbers.
FP division is also slow, but is decoded to a single uop so it doesn't bottleneck the front-end. If FP division is rare and not part of a latency bottleneck, it can be as cheap as multiplication. (But if execution does have to wait for its result, or bottlenecks on its throughput, it's much slower.)
Integer division and other micro-coded instructions can give the CPU a hard time, and creates effects that make code alignment matter where it wouldn't otherwise.
To learn more about x86 CPU internals, see the x86 tag wiki, and especially Agner Fog's microarch guide.
In some older / simpler CPUs, every instruction was effectively micro-coded. For example, the 6502 executed 6502 instructions by running a sequence of internal instructions from a PLA decode ROM. This works well for a non-pipelined CPU, where the order of using the different parts of the CPU can vary from instruction to instruction.
Historically, there was a different technical meaning for "microcode", meaning something like the internal control signals decoded from the instruction word. Especially in a CPU like MIPS where the instruction word mapped directly to those control signals, without complicated decoding. (I may have this partly wrong; I read something like this (other than in the deleted answer on this question) but couldn't find it again later.)