x86 Program Counter abstracted from microarchitect

2020-08-13 07:41发布

问题:

I'm reading the book The RISC-V Reader: An Open Architecture Atlas. The authors, to explain the isolation of an ISA (Instruction Set Architecture) from a particular implementation (i.e., microarchitecture) wrote:

The temptation for an architect is to include instructions in an ISA that helps performance or cost of one implementation at a particular time, but burden different or future implementations.

As far as I understand, it states that when designing an ISA, the ISA should ideally refrain from exposing the details of a particular microarchitecture that implements it.


Keeping the quote above in mind: When it comes to the program counter, on the RISC-V ISA, the program counter (pc) points to the instruction being currently executed. On the other hand, on the x86 ISA, the program counter (eip) does not contain the address of the instruction being currently executed, but the address of the one following the current instruction.

Is the x86 Program Counter abstracted away from the microarchitecture?

回答1:

I'm going to answer this in terms of MIPS instead of x86, because (1) MIPS and x86 have a similarity in this area, and because (2) RISC V was developed by Patterson, et al, after decades of experience with MIPS.  I feel these statement from their books are best understood in this comparison because x86 and MIPS both encode branch offsets relative to the end of the instruction (pc+4 in MIPS).

In both MIPS and x86, PC-relative addressing modes were only found in branches in early ISA versions. Later revisions added PC-relative address calculation (e.g. MIPS auipc or x86-64's RIP-relative addressing mode for LEA or load/store). These are all consistent with each other: the offset is encoded relative to (one past) the end of the instruction (i.e. the next instruction start) — whereas, as you're noting, in RISC V, the encoded branch offset (and auipc, etc..) is relative to the start of the instruction instead.

The value of this is that it removes an adder from certain datapaths, and sometimes one of these datapaths can be on the critical path, so for some implementations this minor shortening of the datapath means a higher clock rate.

(RISC V, of course, still has to produce instruction + 4 for pc-next and the return address of call instructions, but that is much less on the critical path.  Note that in the diagrams below neither shows the capture of pc+4 as a return address.)


Let's compare hardware block diagrams:

                                               MIPS datapath (simplified)


                                               RISC V datapath (simplified)

You can see on the RISC V datapath diagram the line tagged #5 (in red, just above the control oval), bypasses the adder (#4, which adds 4 to the pc for pc-next).


Attribution for diagrams

  • MIPS: Need help in adding functionality to MIPS single cycle datapath?
  • RISC V: https://www.codementor.io/erikeidt/logic-block-diagrams-w6zxr6sp6

Why did x86 / MIPS make that different choice back in their initial versions?

Of course, I can't say for sure.  What it looks like to me is that there was a choice to be made and it simply didn't matter for the earliest implementations, so they probably were not even aware of the potential issue.  Almost every instruction needs to compute instruction-next anyway, so this probably seemed like the logical choice.

At best, they might have saved a few wires, as pc-next is indeed required by other instructions (e.g. call) and pc+0 is not necessarily otherwise needed.

An examination of prior processors might show this was just the way things were done back then, so this might have been more of a carry over of existing methods rather than a design choice.

8086 is not pipelined (other than the instruction prefetch buffer) and variable-length decoding has already found the end of an instruction before it starts to execute.

With years of hindsight, this datapath issue is now addressed in RISC V.

I doubt they made the same level of conscious decision about this, as was done for example, for branch delay slots (MIPS).


As per discussion in comments, 8086 may not have had any exceptions that push the instruction start address. Unlike on later x86 models, divide exceptions pushed the address of the instruction after div/idiv. And in 8086, interrupt-resume after cs rep movsb (or other string instruction) pushed the address of the last prefix, not the whole instruction including multiple prefixes. This "bug" is documented in Intel's 8086 manual (scanned PDF). So it's quite possible 8086 really didn't record the instruction start address or length, only the address where decoding finished before starting execution. This was fixed by at least 286, maybe 186, but applies to all 8086 / 8088 CPUs.

MIPS had virtual memory from the start, so it did need to be able to record the address of a faulting instruction so it could be rerun after exception-return. Plus software TLB-miss handling also required re-rerunning a faulting instruction. But exceptions are slow and flush the pipeline anyway, and aren't detected until well after fetch, so presumably some calculation would be needed regardless.



回答2:

As far as I understand, it states that when designing an ISA, the ISA should ideally refrain from exposing the details of a particular microarchitecture that implements it.

If your metric for an ideal ISA is simplicity, then I might agree with you. But in some cases, it can be beneficial to expose some charactersitics of the microarchitecture through the ISA to improve performance, and there are ways to make the burden of doing that negligible. Consider, for example, the software prefetch instructions in x86. The behavior of these instructions are architecturally defined to be microarchitecturally-dependent. Intel can even design a microarchitecture in the future where these instructions behave as no-ops, without violating the x86 spec. The only burden there is defining the functionality of these instructions1. However, if a prefetch instruction was architecturally defined to prefetch a 64-byte aligned data into the L3 cache and there is no CPUID bit to allow optional support for this instruction, then this may indeed make supporting such an instruction a substantial burden in the future.

Is the x86 Program Counter abstracted away from the microarchitecture?

Before it gets edited by @InstructionPointer, your referred to the "first implementation" of x86 in this question, which is the 8086. This is a simple processor with two pipe stages: fetch and execute. One of the architectural registers is IP, which is defined to contain the 16-bit offset (from the code segment base) of the next instruction. So the architectural value of IP at every instruction is equal to the offset plus the size of the instruction. How is this implemented in the 8086? There is actually no physical register that stores the IP value. There is a single physical instruction pointer register, but it points to the next 16 bits to be fetched into the instruction queue, which can hold up to 6 bytes (see: https://patents.google.com/patent/US4449184A/en). If the current instruction that is being executed is a control transfer instruction, the target address is calculated on-the-fly based on the relative offset from the instruction, the current value in the physical IP, and the number of valid bytes in the instruction queue. For example, if the relative offset is 15, the physical IP is 100, and the instruction queue contains 4 valid bytes, then the target offset is: 100 - 4 + 15 = 111. The physical address can then be calculated by adding the 20-bit code segment address. Clearly, the architectural IP does not expose any of these microarchitectural details. In modern Intel processors, there can be many instructions in-flight and so each instruction needs to carry with it enough information to reconstruct its address or the address of the following instruction.

What if the x86 architectural IP was defined to point to the current instruction instead of the next instruction? How would this impact the design of the 8086? Well, the relative offset from the control transfer instruction becomes relative to the offset of the current instruction, not the next one. In the previous example, we have to subtract the length of the current instruction from 111 to get the target offset. So there may be a need for an additional hardware to track the size of the current instruction and include it in the calculation. But in such an ISA, we can define all control transfer instructions to have a uniform length2 (other instructions can still be of variable-length), which eliminates most of that overhead. I can't think of a realistic example where defining the program counter one way is significantly better than the other. However, it may influence the design of the ISA.


Footnotes:

(1) The decoders may still have to be able to recognize that the prefetch instructions are valid and emit the corresponding uops. However, this burden is not a consequence of defining microarchitecturally-dependent instructions, but rather of defining new instructions, irrespective of the functionality of these instuctions.

(2) Alternatively, the length of the current instruction can be stored in a tiny register. IIRC, the maximum instruction length in the 8086 is 6 bytes, so it takes at most 3 bits to store the length of any instruction. This overhead is very small even for the 8086 days.