Out-of-order execution vs. speculative execution

2019-01-09 11:14发布

I have read the wikipedia page about out-of-order execution and speculative exectution.

What I fail to understant though are the similarities and differences. It seems to me that speculative execution uses out-of-order execution when it has not determined the value of a condition for example.

The confusion came when I read the papers of Meltdown and Spectre and did additional research. It is stated in the Meltdown paper that Meltdown is based on out-of-order execution, while some other resources including the wiki page about sepeculative execution state that Meltdown is based on speculative execution.

I'd like to get some clarification about this.

2条回答
做自己的国王
2楼-- · 2019-01-09 12:02

I'm still having hard time figuring out, how Meltdown uses speculative execution. The example in the paper (the same one I mentioned here earlier) uses IMO only OoO - @Name in a comment

Meltdown is based on Intel CPUs optimistically speculating that loads won't fault, and that if a faulting load reaches the load ports, that it was the result of an earlier mispredicted branch. So the load uop gets marked so it will fault if it reaches retirement, but execution continues speculatively using data the page table entry says you aren't allowed to read from user-space.

Instead of triggering a costly exception-recovery when the load executes, it waits until it definitely reaches retirement, because that's a cheap way for the machinery to handle the branch miss -> bad load case. In hardware, it's easier for the pipe to keep piping unless you need it to stop / stall for correctness. e.g. A load where there's no page-table entry at all, and thus a TLB miss, has to wait. But waiting even on a TLB hit (for an entry with permissions that block using it) would be added complexity. Normally a page-fault is only ever raised after a failed page walk (which doesn't find an entry for the virtual address), or at retirement of a load or store that failed the permissions of the TLB entry it hit.

In a modern OoO pipelined CPU, all instructions are treated as speculative until retirement. Only at retirement do instructions become non-speculative. The Out-of-Order machinery doesn't really know or care whether it's speculating down one side of a branch that was predicted but not executed yet, or speculating past potentially-faulting loads. "Speculating" that loads don't fault or ALU instructions don't raise exceptions happens even in CPUs that aren't really considered speculative, but fully out-of-order execution turns that into just another kind of speculation.

I'm not too worried about an exact definition for "speculative execution", and what counts / what doesn't. I'm more interested in how modern out-of-order designs actually work, and that it's actually simpler to not even try to distinguish speculative from non-speculative until the end of the pipeline. This answer isn't even trying to address simpler in-order pipelines with speculative instruction-fetch (based on branch prediction) but not execution, or anywhere in between that and full-blown Tomasulo's algorithm with a ROB + scheduler with OoO exec + in-order retirement for precise exceptions.

For example, only after retirement can a store ever commit from the store buffer to L1d cache, not before. And to absorb short bursts and cache misses, it doesn't have to happen as part of retirement either. So one of the only non-speculative out-of-order things is committing stores to L1d; they have definitely happened as far as the architectural state is concerned, so they have to be completed even if an interrupt / exception happens.

The fault-if-reaching-retirement mechanism is a good way to avoid expensive work in the shadow of a branch mispredict. It also gives the CPU the right architectural state (register values, etc.) if the exception does fire. You do need that whether or not you let the OoO machinery keep churning on instructions beyond a point where you've detected an exception.


Branch-misses are special: there are buffers that record micro-architectural state (like register-allocation) on branches, so branch-recovery can roll back to that instead of flushing the pipeline and restarting from the last known-good retirement state. Branches do mispredict a fair amount in real code. Other exceptions are very rare.

Modern high-performance CPUs can keep (out-of-order) executing uops from before a branch miss, while discarding uops and execution results from after that point. Fast recovery is a lot cheaper than discarding and restarting everything from a retirement state that's potentially far behind the point where the mispredict was discovered.

E.g. in a loop, the instructions that handle the loop counter might get far ahead of the rest of the loop body, and detect the mispredict at the end soon enough to redirect the front-end and maybe not lose much real throughput, especially if the bottleneck was the latency of a dependency chain or something other than uop throughput.

This optimized recovery mechanism is only used for branches (because the state-snapshot buffers are limited), which is why branch misses are relatively cheap compared to full pipeline flushes. (e.g. on Intel, memory-ordering machine clears, performance counter machine_clears.memory_ordering: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)


Exceptions are not unheard-of, though; page-faults do happen in the normal course of operation. e.g. store to a read-only page triggers copy-on-write. Load or store to an unmapped page triggers page-in or handling the lazy mapping. But thousands to millions of instructions usually run between every page fault even in a process that's allocating new memory frequently. (1 per micro or milli-second on a 1GHz CPU). In code that doesn't map new memory, you can go far longer without exceptions. Mostly just a timer interrupt occasionally in pure number crunching without I/O.

But anyway, you don't want to trigger a pipeline flush or anything expensive until you're sure that an exception will really fire. And that you're sure you have the right exception. e.g. maybe the load address for an earlier faulting load wasn't ready as soon, so the first faulting load to execute wasn't the first in program order. Waiting until retirement is a cheap way to get precise exceptions. Cheap in terms of additional transistors to handle this case, and letting the usual in-order retirement machinery figure out exactly which exception fires is fast.

The useless work done executing instructions after an instruction marked to fault on retirement costs a tiny bit of power, and isn't worth blocking because exceptions are so rare.

This explains why it makes sense to design hardware that was vulnerable to Meltdown in the first place. Obviously it's not safe to keep doing this, now that Meltdown has been thought of.


Fixing Meltdown cheaply

We don't need to block speculative execution after a faulting load; we just need to make sure it doesn't actually use sensitive data. It's not the load succeeding speculatively that's the problem, Meltdown is based on the following instructions using that data to produce data-dependent microarchitectural effects. (e.g. touching a cache line based on the data).

So if the load ports mask the loaded data to zero or something as well as setting the fault-on-retirement flag, execution continues but can't gain any info about the secret data. This should take about 1 extra gate delay of critical path, which is probably possible in the load ports without limiting the clock speed or adding an extra cycle of latency. (1 clock cycle is long enough for logic to propagate through many AND/OR gates within a pipeline stage, e.g. a full 64-bit adder).

Related: I suggested the same mechanism for a HW fix for Meltdown in Why are AMD processors not/less vulnerable to Meltdown and Spectre?.

查看更多
孤傲高冷的网名
3楼-- · 2019-01-09 12:10

Welcome to the Quagmire

You'll find many articles that consider speculative execution as some kind of out-of-order execution (also known as dynamic execution or dynamic scheduling) as well. I could find this among the top search results:

Speculative execution is a technique CPU designers use to improve CPU performance. It’s one of three components of out-of-order execution, also known as dynamic execution. Along with multiple branch prediction (used to predict the instructions most likely to be needed in the near future) and dataflow analysis (used to align instructions for optimal execution, as opposed to executing them in the order they came in), speculative execution delivered a dramatic performance improvement over previous Intel processors.

This is not accurate.

Historical Perspective

There are many patent that were filed and/or granted in 60s and 70s on branch prediction and similar speculative techniques. I've noticed that all of them assumed pipelined execution, but none of them require out-of-order execution (most don't even mention it). By searching for a couple of minutes, I could find these:

Digital computer having high speed branch operation, 1966.

Instruction processing unit for program branches, 1966.

Look-ahead branch detection system, 1968.

Data processing apparatus, 1968. (note sure about this one).

Branch predicting computer, 1981

Decode history table for conditional branch instructions, 1982.

Relationship Between the Terms According to the US Patent Classification

According to the US Patent Classification, the class number G06F9/38 refers to "Concurrent instruction execution". This includes pipelining, dynamic execution, speculative execution, and anything similar to this kind of stuff. The subclass number G06F9/3836 refers to "Instruction issuing". This is specifically concurrent instruction scheduling techniques such as out-of-order execution. The subclass number G06F9/3842, which is a sibling to G06F9/3836, refers to speculative instruction execution. Hence, according to this classification, they are orthogonal. Considering that US patents rely on a very serious legal system, I think I can say that their classification is very precise and done very carefully.

Discussion

Speculative execution and out-of-order execution are orthogonal. One could design a processor this OoO but not speculative or speculative but in-order. I'll explain how. OoO execution is an execution model in which instructions can be executed in an order that is potentially different from the program order. However, the instructions are still retired in program order so that the program's observed behavior is the same as the one intuitively expected by the programmer. (Although, technically speaking, it's possible to design an OoO processor that retires instructions in some unnatural order with certain constraints).

Speculative execution is an execution model in which instructions can enter the pipeline and even begin execution without even knowing for sure that they will indeed be required to execute (according to the control flow of the program). In the broader definition of speculative execution, the processor is allowed to speculate on anything really, including the operands and the memory locations used and accessed by the instruction.

The Meltdown paper does define these terms on page 3:

In this paper, we refer to speculative execution in a more restricted meaning, where it refers to an instruction sequence following a branch, and use the term out-of-order execution to refer to any way of getting an operation executed before the processor has committed the results of all prior instructions.

Note that instructions can be executed speculatively, yet in-order. When the decoding stage of the pipeline identifies a conditional branch instruction, it can speculate on the branch and its target and fetch instructions from the predicted target location. But still, instructions can also be executed in-order. However, note that once the speculated conditional branch instruction and the instructions fetched from the predicted path (or both paths) reach the issue stage, none of them will be issued until all earlier instructions retire. When that happens, the processor would know whether the prediction was correct and flush the pipeline otherwise.

Processors designed to carry out simple tasks and used in embedded systems or IoT devices are typically neither speculative nor OoO. Desktop and server processors are both speculative and OoO. In the middle of the computing spectrum (mobile phones and microcontrollers), you can find processors that are OoO, but not speculative (such as the ARM Cortex-A9). Some Intel Atom processors are speculative, but in-order. Speculative execution is particularly beneficial when used with OoO.

The confusion came when I read the papers of Meltdown and Spectre and did additional research. It is stated in the Meltdown paper that Meltdown is based on out-of-order execution, while some other resources including the wiki page about sepeculative execution state that Meltdown is based on speculative execution.

The Meltdown vulnerability as described in the paper requires both speculative and out-of-order execution. However, this is really a vague statement since there are many different speculative and out-of-order execution implementations. Meltdown doesn't work with just any type of OoO or speculative execution. For example, ARM11 (used in Raspberry Pis) supports some limited OoO and speculative execution, but it's not vulnerable.

Related: What is the difference between Superscalar and OoO execution?.

查看更多
登录 后发表回答