I have done some reading about Spectre v2 and obviously you get the non technical explanations. Peter Cordes has a more in-depth explanation but it doesn't fully address a few details. Note: I have never performed a Spectre v2 attack so I do not have hands on experience. I have only read up about about the theory.
My understanding of Spectre v2 is that you make an indirect branch mispredict for instance if (input < data.size)
. If the Indirect Target Array (which I'm not too sure of the details of -- i.e. why it is separate from the BTB structure) -- which is rechecked at decode for RIPs of indirect branches -- does not contain a prediction then it will insert the new jump RIP (branch execution will eventually insert the target RIP of the branch), but for now it does not know the target RIP of the jump so any form of static prediction will not work. My understanding is it is always going to predict not taken for new indirect branches and when Port 6 eventually works out the jump target RIP and prediction it will roll back using the BOB and update the ITA with the correct jump address and then update the local and global branch history registers and the saturating counters accordingly.
The hacker needs to train the saturating counters to always predict taken which, I imagine, they do by running the if(input < data.size)
multiple times in a loop where input
is set to something that is indeed less than data.size
(catching errors accordingly) and on the final iteration of the loop, make input
more than data.size
(1000 for instance); the indirect branch will be predicted taken and it will jump to the body of the if statement where the cache load takes place.
The if statement contains secret = data[1000]
(A particular memory address (data[1000]) that contains secret data is targeted for loading from memory to cache) then this will be allocated to the load buffer speculatively. The preceding indirect branch is still in the branch execution unit and waiting to complete.
I believe the premise is that the load needs to be executed (assigned a line fill buffer) before the load buffers are flushed on the misprediction. If it has been assigned a line fill buffer already then nothing can be done. It makes sense that there isn't a mechanism to cancel a line fill buffer allocation because the line fill buffer would have to pend before storing to the cache after returning it to the load buffer. This could cause line fill buffers to become saturated because instead of deallocating when required (keeping it in there for speed of other loads to the same address but deallocating when the there are no other available line buffers). It would not be able to deallocate until it receives some signal that a flush is not going to occur, meaning it has to halt for the previous branch to execute instead of immediately making the line fill buffer available for the stores of the other logical core. This signalling mechanism could be difficult to implement and perhaps it didn't cross their minds (pre-Spectre thinking) and it would also introduce delay in the event that branch execution takes enough time for hanging line fill buffers to cause a performance impact i.e. if data.size
is purposefully flushed from the cache (CLFLUSH
) before the final iteration of the loop meaning branch execution could take up to 100 cycles.
I hope my thinking is correct but I'm not 100% sure. If anyone has anything to add or correct then please do.
Sometimes the term "BTB" is used collectively to refer to all of the buffers used by the branch prediction unit. However, there are actually multiple buffers all of which are used in every cycle to make target and direction predictions. In particular, the BTB is used to make predictions for direct branches, the ITB (indirect target buffer) is used to make predictions for indirect branches except for returns, and the RSB is used to make predictions for returns. The ITB is also called the IBTB or Indirect Target Array. All of these terms are used by different vendors and researchers. Typically, the BTB is used to make initial predictions for all kinds of branch instructions when the other buffers miss. But later the predictor learns more about the branches and the other buffers come into play. If multiple dynamic instances of the same indirect branch have all the same target, then the BTB might also be used instead of the ITB. The ITB is much more accurate when the same branch has multiple targets and is designed specifically to deal with such branches. See: Branch prediction and the performance of interpreters — Don't trust folklore. The first Intel processor that implemented separate BTB and ITB structures is the Pentium M. All later Intel Core processors have dedicated ITBs.
The Spectre V1 exploit is based on training the BTB using an attacker program so that when the victim executes a branch that aliases the same BTB entry, the processor is tricked into speculatively executing instructions (called the gadget) to leak information. The Spectre V2 exploit is similar but is based on training the ITB instead. The crucial difference here is that in V1, the processor mispredicts the direction of the branch, while in V2, the processor mispredicts the target of the branch (and, in case of a conditional indirect branch, the direction as well because we want it to be taken). In programs that are interpreted, JIT-compiled, or make use of dynamic polymorphism, there can be many indirect branches (other than returns). A particular indirect branch may never be intended to go to some location, but by mistraining the predictor, it can be made to jump anywhere we want. It is exactly for this reason why V2 is very powerful; no matter where the gadget is and no matter what the intentional control flows of the program are, you can pick one of the indirect branches and make it speculatively jump to the gadget.
Note that typically the linear address of the target of a static direct branch remains the same throughout the lifetime of the program. There is only one situation where this may not be the case: dynamic code modification. So at least in theory, a Spectre exploit can be developed based on target misprediction of direct branches.
Regarding reclamation of LFBs, I don't really understand what you're saying. When a load request that missed the L1D receives the data into the LFB, the data is immediately forwarded to the bypass interconnect of the pipeline. There needs to be a way to determine which load uop has requested this data. The data returned must be tagged with the uop ID of the load. The sources of the uops in the RS that are waiting for the data are represented as the uop IDs of the loads. In addition, the ROB entry that holds the load uop needs to be marked as completed so that it can be retired and, in pre-SnB, the returned data needs to be written into the ROB. If, on pipeline flush, an outstanding load request in an LFB is not cancelled, and if the load uop ID got reused for some other uop, when the data arrives, it might be incorrectly forwarded to whatever new uops are currently in the pipeline, thereby corrupting the microarchitectural state. So there needs to be a way to ensure that this does not happen under any circumstances. It's very possible to cancel outstanding load requests and speculative RFOs on a pipeline flush by simple marking all of the valid LFB entries as "cancelled", just so that the data is not returned to the pipeline. However, the data might still be fetched and filled in into one or more levels of caches. Requests in the LFB are identified by line-aligned physical addresses. There can be other possible designs.
I've decided to run an experiment to determine exactly when the LFBs get deallocated on Haswell. Here is how it works:
For this to work, hyperthreading and both L1 prefetchers need to be turned off to ensure that we own all of the 10 LFBs of the L1.
The
LFENCE
instructions ensure that we don't run out of LFBs when executing on a correctly predicted path. The key idea here is that the inner jump will be mispredicted once per outer iteration, so up to 10 loads of the inner iteration that are on the mispredicted path can be allocated in the LFBs. Note that theLFENCE
prevents loads from later iterations to be allocated. After a few cycles, the inner branch will be resolved and a misprediction occurs. The pipeline is cleared and the frontend is resteered to fetch and execute the load instructions in the outer loop.There are two possible outcomes:
L1D_PEND_MISS.FB_FULL
).When there are three loads in the outer loop after the inner jump, the measured value of
L1D_PEND_MISS.FB_FULL
is about equal to the number of outer iterations. That's one request per outer loop iteration. This means that when the three loads on the correct path get issued to the L1D, the loads from the mispredcited path are still occupying the 8 LFB entries, resulting in an FB full event for the third load. This suggests that loads in the LFBs only get deallcoated when the load actually completes.If I put less than two loads in the outer loop, there will be basically no FB full events. There is one thing I noticed: for every additional load in the outer loop beyond three loads, the
L1D_PEND_MISS.FB_FULL
gets increased by about 20K instead of the expected 10K. I think what's happening is that when a load request of a load uop gets issued to the L1D for the first time and the all LFBs are in use, it gets rejected. Then when an LFB becomes available, two loads pending in the load buffer are sent to the L1D, one will be allocated in the LFB and the other will get rejected. So we get two LFB full events per additional load. However, when there are three loads in the outer loop, only the third one would be waiting for an LFB, so we get one event per outer loop iteration. Essentially, the load buffer cannot distinguish between having one LFB available or two LFBs; it only gets to know that at least one LFB is free and so it attempts to send two load requests at the same time since there are two load ports.Thanks Brendan and Hadi Brais, after reading your answers and finally reading the spectre paper it is clear now where I was going wrong in my thinking and I confused the two a little.
I was partially describing Spectre v1 which causes a bounds check bypass by mistraining the branch history of a jump i.e.
if (x < array1_size)
to a spectre gadget. This is obviously not an indirect branch. The hacker does this by invoking a function containing the spectre gadget with legal parameters to prime the branch predictor (PHT+BHT) and then invoke with illegal parameters to bringarray1[x]
into cache. They then reprime the branch history by supplying legal parameters and then flusharray1_size
from cache (which I'm not sure how they do because even if the attacker process knows the VA ofarray1_size
, the line cannot be flushed because the TLB contains a different PCID for the process, so it must be caused to be evicted in some way i.e. filling the set at that virtual address). They then invoke with the same illegal parameters as before and asarray1[x]
is in cache butarray1_size
is not,array[x]
will resolve quickly and begin the load ofarray2[array1[x]]
while still waiting onarray1_size
, which loads a position inarray2
based on the secret at any x that transcends the bounds ofarray1
. The attacker then recalls the function with a valid value of x and times the function call (I assume the attacker must know the contents ofarray1
because ifarray2[array1[8]]
results in a faster access they need to know what is atarray1[8]
as that is the secret, but surely that array would have to contain every 2^8 bit combination right).Spectre v2 on the other hand requires a second attack process that knows the virtual address of an indirect branch in the victim process so that it can poison the target and replace it with another address. If the attack process contains a jump instruction that would reside in the same set, way and tag in the IBTB as the victim indirect branch would then it just trains that branch instruction to predict taken and jump to a virtual address which happens to be that of the gadget in the victim process. When the victim process encounters the indirect branch the wrong target address from the attack program is in the IBTB. It is crucial that it is an indirect branch because falsities as a result of a process switch are usually checked at decode i.e. if the branch target differs from the target in the BTB for that RIP then it flushes the instructions fetched before it. This cannot be done with indirect branches because it does not know the target until the execution stage and hence the idea is that the indirect branch selected depends on a value that needs to be fetched from cache. It then jumps to this target address which is that of the gadget and so on and so forth.
The attacker needs to know the source code of the victim process to identify a gadget and they need to know the VA at which it will reside. I assume this could be done by knowing predictably where code will be loaded. I believe .exes are typically loaded at x00400000 for instance and then there is a BaseOfCode in the PE header.
Edit: I just read Appendix B of the spectre paper and it makes for a nice Windows implementation of Spectre v2.
It uses
ntdll.dll
(.dll full of native API system call stubs) andkernel32.dll
(Windows API) which are always mapped in the user virtual address space on the direction of ASLR (specified in the .dll images), except the physical address is likely to be the same due to copy-on-write view mapping into the page cache. The indirect branch to poison will be in the Windows APISleep()
function inkernel32.dll
which appears to indirectly callNtDelayExecution()
inntdll.dll
. The attacker then ascertains the address of the indirect branch instruction and maps a page encompassing the victim address that contains the target address into its own address space and changes the target address stored at that address to that of the gadget that they identified to be residing somewhere in the same or another function inntdll.dll
(I'm not entirely sure (due to ASLR) how the attacker knows for certain where the victim process mapskernel32.dll
andntdll.dll
in its address space in order to locate the address of the indirect branch inSleep()
for the victim. Appendix B claims they used 'Simple pointer operations' to locate the indirect branch and address that contains the target -- how that works I'm not sure). Threads are then launched with the same affinity of the victim (so that the victim and mistraining threads hyperthread on the same physical core) which callSleep()
themselves to train it indirectly which in the address space context of the hack process will now jump to the address of the gadget. The gadget is temporarily replaced with aret
so that it returns fromSleep()
smoothly. These threads will also execute a sequence before the indirect jump to mimic what the global branch history of the victim would be before encountering the indirect jump to fully ensure that the branch is taken in an alloyed history. A separate thread is then launched with the complement of the thread affinity of the victim that repeatedly evicts the victim's memory address containing the jump destination to ensure that when the victim does encounter the indirect branch it will take a long RAM access to resolve which allows the gadget to speculate ahead before the branch destination can be checked against the BTB entry and the pipeline flushed. In JavaScript, eviction is done by loading to the same cache set i.e. in multiples of 4096. The mistraining threads, eviction threads and victim threads are all running and looping at this stage. When the victim process loop callsSleep()
, the indirect branch speculates to the gadget due to the IBTB entry that the hacker poisoned previously. A probing thread is launched with the complement of the victim process thread affinity (so as not to interfere with the mistraining and victim branch history). The probing thread will modify the header of the file that the victim process uses which results in those values residing inebx
andedi
whenSleep()
is called meaning the probing thread can directly influence the values stored inebx
andedi
. The spectre gadget branched to in the example adds the value stored at[ebx+edx+13BE13BDh]
toedi
and then loads a value at the address stored inedi
and adds it with a carry todl
. This allows the probing thread to learn the value stored at[ebx+edx+13BE13BDh]
as if it selects an originaledi
of 0 then the value accessed in the second operation will be loaded from the virtual address range 0x0 – 0x255 by which time the indirect branch will resolve but the side effects are already present. The attack process needs to ensure that it has mapped the same physical address into the same location in its virtual address space in order to probe the probing array with a timing attack. Not sure how it does this but in windows it would, AFAIK, need to map a view of a page-file–backed section object that has been opened by the victim at that location. Either that or it would manipulate the victim to call the spectre gadget with a negative TCebx
value such thatebx+edx+13BE13BDh
= 0
,=1
,...,=255
and somehow time that call. This could potentially also be achieved by using APC injection.For branches, some are like
jc .somewhere
where the CPU only really needs to guess if the branch will be taken or not taken to be able to speculate down the guessed path. However, some branches are likejmp [table+eax*8]
where there can be over 4 billion possible directions, and for those cases the CPU needs to guess the target address to be able to speculate down the guessed path. Because there's very different types of branches, the CPU uses very different types of predictors.For Spectre, there's a "meta pattern" - the attacker uses speculative execution to trick CPU into leaving information in something, then extracts that information from the something. There are multiple possibilities for "something" (data caches, instruction caches, TLBs, branch target buffer, branch direction buffer, return stack, write-combining buffers, ...) and therefore there's are many possible variations of spectre (and not just the "well known first two variations" that were made public in early 2018).
For spectre v1 (where "something" is a data cache) the attacker needs some way to trick the CPU into putting data into the data cache (e.g. a load and then a second load that depends on the value from the first load, which can be executed speculatively) and some way to extract the information (flush everything in the cache, then use the amount of time that a load takes to determine how the state of the data cache changed).
For spectre v2 (where "something" is the branch direction buffer that's used for instructions like
jc .somewhere
) the attacker needs some way to trick the CPU into putting data into the branch direction buffer (e.g. a load and then a branch that depends on the load, which can be executed speculatively) and some way to extract the information (set the branch direction buffer to a known state beforehand, then use the amount of time that a branch takes to determine how the state of the branch direction buffer changed).For all of the many possible variations of spectre, the only important thing (for defense) is what the "something" can be (and how to prevent information from getting into the "something", or flush/overwrite/destroy information that got into the "something"). Everything else (specific details of one of the many possible implementations of code to attack any one of the many possible spectre variations) is unimportant.
Vague History Of Spectre
The original Spectre (v1, using cache timing) was found in 2017 and publicly announced in January 2018. It was like a dam bursting, and a few other variants (e.g. v2, using branch prediction) quickly followed. These early variations grabbed a lot of publicity. In the ~6 months or so after that multiple other variants were found, but didn't get as much publicity and a lot of people weren't (and still aren't) aware of them. By the "latter half" of 2018 people (e.g. me) started losing track of which variants were proven (via. "proof of concept" implementations) and which were still unproven, and some researchers started trying to enumerate the possibilities and establish naming conventions for them. The best example of this that I've seen so far is "A Systematic Evaluation of Transient Execution Attacks and Defenses" (see https://arxiv.org/pdf/1811.05441.pdf ).
However, the "hole in the dam wall" isn't something that can be plugged easily, and (for random guesses) I think it's going to take several years before we can assume all possibilities have been explored (and I think the need for mitigation will never go away).