I'm trying to understand the RIDL class of vulnerability.
This is a class of vulnerabilities that is able to read stale data from various micro-architectural buffers.
Today the known vulnerabilities exploits: the LFBs, the load ports, the eMC and the store buffer.
The paper linked is mainly focused on LFBs.
I don't understand why the CPU would satisfy a load with the stale data in an LFB.
I can imagine that if a load hits in L1d it is internally "replayed" until the L1d brings data into an LFB signalling the OoO core to stop "replaying" it (since the data read are now valid).
However I'm not sure what "replay" actually mean.
I thought loads were dispatched to a load capable port and then recorded in the Load Buffer (in the MOB) and there being eventually hold as needed until their data is available (as signalled by the L1).
So I'm not sure how "replaying" comes into play, furthermore for the RIDL to work, each attempt to "play" a load should also unblock dependent instructions.
This seems weird to me as the CPU would need to keep track of which instructions to replay after the load correctly completes.
The paper on RIDL use this code as an example (unfortunately I had to paste it as an image since the PDF layout didn't allow me to copy it):
The only reason it could work is if the CPU will first satisfy the load at line 6 with a stale data and then replay it.
This seems confirmed few lines below:
Specifically, we may expect two
accesses to be fast, not just the one corresponding to the
leaked information. After all, when the processor discovers
its mistake and restarts at Line 6 with the right value, the
program will also access the buffer with this index.
But I would expect the CPU to check the address of the load before forwarding the data in the LFB (or any other internal buffer).
Unless the CPU actually executes the load repeatedly until it detect the data loaded is now valid (i.e. replaying).
But, again, why each attempt would unblock dependent instructions?
How does exactly the replaying mechanism work, if it even exists, and how this interacts with the RIDL vulnerabilities?
I don't think load replays from the RS are involved in the RIDL attacks. So instead of explaining what load replays are (@Peter's answer is a good starting point for that), I'll discuss what I think is happening based on my understanding of the information provided in the RIDL paper, Intel's analysis of these vulnerabilities, and relevant patents.
Line fill buffers are hardware structures in the L1D cache used to hold memory requests that miss in the cache and I/O requests until they get serviced. A cacheable request is serviced when the required cache line is filled into the L1D data array. A write-combining write is serviced when the any of the conditions for evicting a write-combining buffer occur (as described in the manual). A UC or I/O request is serviced when it is sent to the L2 cache (which occurs as soon as possible).
Refer to Figure 4 of the RIDL paper. The experiment used to produce these results works as follows:
- The victim thread writes a known value to a single memory location. The memory type of the memory location is WB, WT, WC, or UC.
- The victim thread reads the same memory location in a loop. Each load operation is followed by
MFENCE
and there is an optional CLFLUSH
. It's not clear to me from the paper the order of CLFLUSH
with respect to the other two instructions, but it probably doesn't matter. MFENCE
serializes the cache line flushing operation to see what happens when every load misses in the cache. In addition, MFENCE
reduces contention between the two logical cores on the L1D ports, which improves the throughput of the attacker.
- An attacker thread running on a sibling logical core executes the code shown in Listing 1 in a loop. The address used at Line 6 can be anything. The only thing that matters is that load at Line 6 either faults or causes a page walk that requires an microcode assist (to set the accessed bit in the page table entry). A page walk requires using the LFBs as well and most of the LFBs are shared between the logical cores.
It's not clear to me what the Y-axis in Figure 4 represents. My understanding is that it represents the number of lines from the covert channel that got fetched into the cache hierarchy (Line 10) per second, where the index of the line in the array is equal to the value written by the victim.
If the memory location is of the WB type, when the victim thread writes the known value to the memory location, the line will be filled into the L1D cache. If the memory location is of the WT type, when the victim thread writes the known value to the memory location, the line will not be filled into the L1D cache. However, on the first read from the line, it will be filled. So in both cases and without CLFLUSH
, most loads from the victim thread will hit in the cache.
When the cache line for a load request reaches the L1D cache, it gets written first in the LFB allocated for the request. The requested portion of the cache line can be directly supplied to the load buffer from the LFB without having to wait for the line to be filled in the cache. According to the description of the MFBDS vulnerability, under certain situations, stale data from previous requests may be forwarded to the load buffer to satisfy a load uop. In the WB and WT cases (without flushing), the victim's data is written into at most 2 different LFBs. The page walks from the attacker thread can easily overwrite the victim's data in the LFBs, after which the data will never be found in there by the attacker thread. All load requests that hit in the L1D cache don't go through the LFBs; there is a separate path for them, which is multiplexed with the path from the LFBs. Nonetheless, there are some cases where stale data (noise) from the LFBs is being speculatively forwarded to the attacker's logical core, which is probably from the page walks (and maybe interrupt handlers and hardware prefetchers).
It's interesting to note that the frequency of stale data forwarding in the WB and WT cases is much lower than in all of the other cases. This is could be explained by the fact that the victim's throughput is much higher in these cases and the experiment may terminate earlier.
In all other cases (WC, UC, and all types with flushing), every load misses in the cache and the data has to be fetched from main memory to the load buffer through the LFBs. The following sequence of events occur:
- The accesses from the victim hit in the TLB because they are to the same valid virtual page. The physical address is obtained from the TLB and provided to the L1D, which allocates an LFB for the request (due to a miss) and the physical address is written into the LFB together with other information that describes the load request. At this point, the request from the victim is pending in the LFB. Since the victim executes an
MFENCE
after every load, there can be at most one outstanding load in the LFB at any given cycle from the victim.
- The attacker, running on the sibling logical core, issues a load request to the L1D and the TLB. Each load is to an unmapped user page, so it will cause a fault. When the it misses in the TLB, the MMU tells the load buffer that the load should be blocked until the address translation is complete. According to paragraph 26 of the patent and other Intel patents, that's how TLB misses are handled. The address translation is still in progress the load is blocked.
- The load request from the victim receives its cache line, which gets written into the LFB allcoated for the load. The part of the line requested by the load is forwarded to the MOB and, at the same time, the line is written into the L1D cache. After that, the LFB can be deallcoated, but none of the fields are cleared (except for the field that indicates that its free). In particular, the data is still in the LFB. The victim then sends another load request, which also misses in the cache either because it is uncacheable or because the cache line has been flushed.
- The address translation process of the attacker's load completes. The MMU determines that a fault needs to be raised because the physical page is not present. However, the fault is not raised until the load is about retire (when it reaches the top of the ROB). Invalid translations are not cached in the MMU on Intel processors. The MMU still has to tell the MOB that the translation has completed and, in this case, sets a faulting code in the corresponding entry in the ROB. It seems that when the ROB sees that one of the uops has valid fault/assist code, it disables all checks related to sizes and addresses of that uops (and possibly all later uops in the ROB). These checks don't matter anymore. Presumably, disabling these checks saves dynamic energy consumption. The retirement logic knows that when the load is about to retire, a fault will be raised anyway. At the same time, when the MOB is informed that the translation is completed, it replays the attacker's load, as usual. This time, however, some invalid physical address is provided to the L1D cache. Normally, the physical address needs to compared against all requests pending in the LFBs from the same logical core to ensure that the logical core sees the most recent values. This is done before or in parallel with looking up the L1D cache. The physical address doesn't really matter because the comparison logic is disabled. However, the results of all comparisons behave as if the result indicates success. If there is at least one allocated LFB, the physical address will match some allocated LFB. Since there is an outstanding request from the victim and since the victim's secret may have already been written in the same LFB from previous requests, the same part of the cache line, which technically contains stale data and in this case (the stale data is the secret), will be forwarded to the attacker. Note that the attacker has control over the offset within a cache line and the number of bytes to get, but it cannot control which LFB. The size of a cache line is 64 bytes, so only the 6 least significant bits of the virtual address of the attacker's load matter, together with the size of the load. The attacker then uses the data to index into its array to reveal the secret using a cache side channel attack. This behavior would also explain MSBDS, where apparently the data size and STD uop checks are disabled (i.e, the checks trivially pass).
- Later, the faulting/assisting load reaches the top of the ROB. The load is not retired and the pipeline is flushed. In case of faulting load, a fault is raised. In case of an assisting load, execution is restarted from the same load instruction, but with an assist to set the required flags in the paging structures.
- These steps are repeated. But the attacker may not always be able to leak the secret from the victim. As you can see, it has to happen that the load request from the attacker hits an allocated LFB entry that contains the secret. LFBs allocated for page walks and hardware prefetchers may make it harder to perform a successful attack.
If the attacker's load didn't fault/assist, the LFBs will receive a valid physical address from the MMU and all checks required for correctness are performed. That's why the load has to fault/assist.
The following quote from the paper discusses how to perform a RIDL attack in the same thread:
we perform the RIDL attack without SMT by writing values in our own
thread and observing the values that we leak from the same thread.
Figure3 shows that if we do not write the values (“no victim”), we leak
only zeros, but with victim and attacker running in the same hardware
thread (e.g., in a sandbox), we leak the secret value in almost all
cases.
I think there are no privilege level changes in this experiment. The victim and the attacker run in the same OS thread on the same hardware thread. When returning from the victim to the attacker, there may still be some outstanding requests in the LFBs from (especially from stores). Note that in the RIDL paper, KPTI is enabled in all experiments (in contrast to the Fallout paper).
In addition to leaking data from LFBs, MLPDS shows that data can also be leaked from the load port buffers. These include the line-split buffers and the buffers used for loads larger than 8 bytes in size (which I think are needed when the size of the load uop is larger than the size of the load port, e.g., AVX 256b on SnB/IvB that occupy the port for 2 cycles).
The WB case (no flushing) from Figure 5 is also interesting. In this experiment, the victim thread writes 4 different values to 4 different cache lines instead of reading from the same cache line. The figure shows that, in the WB case, only the data written to the last cache line is leaked to the attacker. The explanation may depend on whether the cache lines are different in different iterations of the loop, which is unfortunately not clear in the paper. The paper says:
For WB without flushing, there is a signal only for the last cache
line, which suggests that the CPU performs write combining in a single
entry of the LFB before storing the data in the cache.
How can writes to different cache lines be combining in the same LFB before storing the data in the cache? That makes zero sense. An LFB can hold a single cache line and and a single physical address. It's just not possible to combine writes like that. What may be happening is that WB writes are being written in the LFBs allocated for their RFO requests. When the invalid physical address is transmitted to the LFBs for comparison, the data may always be provided from the LFB that was last allocated. This would explain why only the value written by the fourth store is leaked.
For information on MDS mitigations, see: What are the new MDS attacks, and how can they be mitigated?. My answer there only discusses mitigations based on the Intel microcode update (not the very interesting "software sequences").
The following figure shows the vulnerable structures that use data speculation.
replay = being dispatched again from the RS (scheduler). (This isn't a complete answer to your whole question, just to the part about what replays are. Although I think this covers most of it, including unblocking dependent uops.)
It turns out that a cache-miss load doesn't just sit around in a load buffer and wake up dependent uops when the data arrives. The scheduler has to re-dispatch the load uop to actually read the data and write-back to a physical register. (And put it on the forwarding network where dependent uops can read it in the next cycle.)
So L1 miss / L2 hit will result in 2x as many load uops dispatched. (The scheduler is optimistic, and L2 is on-core so the expected latency of an L2 hit is fixed, unlike time for an off-core response. IDK if the scheduler continues to be optimistic about data arriving at a certain time from L3.)
The RIDL paper provides some interesting evidence that load uops do actually directly interact with LFBs, not waiting for incoming data to be placed in L1d and just reading it from there.
We can observe replays in practice most easily for cache-line-split loads, because causing that repeatedly is even more trivial than cache misses, taking less code. The counts for uops_dispatched_port.port_2
and port_3
will be about twice as high for a loop that does only split loads. (I've verified this in practice on Skylake, using essentially the same loop and testing procedure as in How can I accurately benchmark unaligned access speed on x86_64)
Instead of signalling successful completion back to the RS, a load that detects a split (only possible after address-calculation) will do the load for the first part of the data, putting this result in a split buffer1 to be joined with the data from the 2nd cache line the 2nd time the uop dispatches. (Assuming that neither time is a cache miss, otherwise it will take replays for that, too.)
When a load uop dispatches, the scheduler anticipates it will hit in L1d and dispatches dependent uops so they can read the result from the forwarding network in the cycle the load puts them on that bus.
If that didn't happen (because the load data wasn't ready), the dependent uops will have to be replayed as well. Again, IIRC this is observable with the perf counters for dispatch
to ports.
Existing Q&As with evidence of uop replays on Intel CPUs:
- Why does the number of uops per iteration increase with the stride of streaming loads?
- Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
- How can I accurately benchmark unaligned access speed on x86_64 and Is there a penalty when base+offset is in a different page than the base?
- Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths points out that the possibility of replay mean the RS needs to hold on to a uop until an execution unit signals successful completion back to the RS. It can't drop a uop on first dispatch (like I guessed when I first wrote that answer).
Footnote 1:
We know there are a limited number of split buffers; there's a ld_blocks.no_sr
counter for loads that stall for lack of one. I infer they're in the load port because that makes sense. Re-dispatching the same load uop will send it to the same load port because uops are assigned to ports at issue/rename time. Although maybe there's a shared pool of split buffers.
RIDL:
Optimistic scheduling is part of the mechanism that creates a problem. The more obvious problem is letting execution of later uops see a "garbage" internal value from an LFB, like in Meltdown.
http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/ even shows that meltdown loads in PPro expose various bits of microarchitectural state, exactly like this vulnerability that still exists in the latest processors.
The Pentium Pro takes the “load value is a don’t-care” quite literally. For all of the forbidden loads, the load unit completes and produces a value, and that value appears to be various values taken from various parts of the processor. The value varies and can be non-deterministic. None of the returned values appear to be the memory data, so the Pentium Pro does not appear to be vulnerable to Meltdown.
The recognizable values include the PTE for the load (which, at least in recent years, is itself considered privileged information), the 12th-most-recent stored value (the store queue has 12 entries), and rarely, a segment descriptor from somewhere.
(Later CPUs, starting with Core 2, expose the value from L1d cache; this is the Meltdown vulnerability itself. But PPro / PII / PIII isn't vulnerable to Meltdown. It apparently is vulnerable to RIDL attacks in that case instead.)
So it's the same Intel design philosophy that's exposing bits of microarchitectural state to speculative execution.
Squashing that to 0 in hardware should be an easy fix; the load port already knows it wasn't successful so masking the load data according to success/fail should hopefully only add a couple extra gate delays, and be possible without limiting clock speed. (Unless the last pipeline stage in the load port was already the critical path for CPU frequency.)
So probably an easy and cheap fix in hardware for future CPU, but very hard to mitigate with microcode and software for existing CPUs.