How can I create a spectre gadget in practice?

I'm developing (NASM + GCC targetting ELF64) a PoC that uses a spectre gadget that measures the time to access a set of cache lines (FLUSH+RELOAD).

How can I make a reliable spectre gadget?

I believe I understand the theory behind the FLUSH+RELOAD technique, however in practice, despiste some noise, I'm unable to produce a working PoC.

Since I'm using the Timestamp counter and the loads are very regular I use this script to disable the prefetchers, the turbo boost and to fix/stabilize the CPU frequency:

#!/bin/bash

sudo modprobe msr

#Disable turbo
sudo wrmsr -a 0x1a0 0x4000850089

#Disable prefetchers
sudo wrmsr -a 0x1a4 0xf

#Set performance governor
sudo cpupower frequency-set -g performance

#Minimum freq
sudo cpupower frequency-set -d 2.2GHz

#Maximum freq
sudo cpupower frequency-set -u 2.2GHz

I have a continuous buffer, aligned on 4KiB, large enough to span 256 cache lines separated by an integral number GAP of lines.

SECTION .bss ALIGN=4096

 buffer:    resb 256 * (1 + GAP) * 64

I use this function to flush the 256 lines.

flush_all:
 lea rdi, [buffer]              ;Start pointer
 mov esi, 256                   ;How many lines to flush

.flush_loop:
  lfence                        ;Prevent the previous clflush to be reordered after the load
  mov eax, [rdi]                ;Touch the page
  lfence                        ;Prevent the current clflush to be reordered before the load

  clflush  [rdi]                ;Flush a line
  add rdi, (1 + GAP)*64         ;Move to the next line

  dec esi
 jnz .flush_loop                ;Repeat

 lfence                         ;clflush are ordered with respect of fences ..
                                ;.. and lfence is ordered (locally) with respect of all instructions
 ret

The function loops through all the lines, touching every page in between (each page more than once) and flushing each line.

Then I use this function to profile the accesses.

profile:
 lea rdi, [buffer]           ;Pointer to the buffer
 mov esi, 256                ;How many lines to test
 lea r8, [timings_data]      ;Pointer to timings results

 mfence                      ;I'm pretty sure this is useless, but I included it to rule out ..
                             ;.. silly, hard to debug, scenarios

.profile: 
  mfence
  rdtscp
  lfence                     ;Read the TSC in-order (ignoring stores global visibility)

  mov ebp, eax               ;Read the low DWORD only (this is a short delay)

  ;PERFORM THE LOADING
  mov eax, DWORD [rdi]

  rdtscp
  lfence                     ;Again, read the TSC in-order

  sub eax, ebp               ;Compute the delta

  mov DWORD [r8], eax        ;Save it

  ;Advance the loop

  add r8, 4                  ;Move the results pointer
  add rdi, (1 + GAP)*64      ;Move to the next line

  dec esi                    ;Advance the loop
 jnz .profile

 ret

An MCVE is given in appendix and a repository is available to clone.

When assembled with GAP set to 0, linked and executed with taskset -c 0 the cycles necessary to fetch each line are shown below.

Only 64 lines are loaded from memory.

The output is stable across different runs. If I set GAP to 1 only 32 lines are fetched from memory, ofcourse 64 * (1+0) * 64 = 32 * (1+1) * 64 = 4096, so this may be related to paging?

If a store is executed before the profiling (but after the flush) to one of the first 64 lines, the output changes to this

Any store the the other lines gives the first type of output.

I suspect the math in the is broken but I need another couple of eyes find out where.

EDIT

Hadi Brais pointed out a misuse of a volatile register, after fixing that the output is now inconsistent.
I see prevalently runs where the timings are low (~50 cycles) and sometimes runs where the timing are higher (~130 cycles).
I don't know where the 130 cycles figure come from (too low for memory, too high for the cache?).

Code is fixed in the MCVE (and the repository).

If a store to any of the first lines is executed before the profiling, no change is reflected in the output.

APPENDIX - MCVE

BITS 64
DEFAULT REL

GLOBAL main

EXTERN printf
EXTERN exit

;Space between lines in the buffer
%define GAP 0

SECTION .bss ALIGN=4096



 buffer:    resb 256 * (1 + GAP) * 64   


SECTION .data

 timings_data:  TIMES 256 dd 0


 strNewLine db `\n0x%02x: `, 0
 strHalfLine    db "  ", 0
 strTiming  db `\e[48;5;16`,
  .importance   db "0",
        db `m\e[38;5;15m%03u\e[0m `, 0  

 strEnd     db `\n\n`, 0

SECTION .text

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;FLUSH ALL THE LINES OF A BUFFER FROM THE CACHES
;
;

flush_all:
 lea rdi, [buffer]  ;Start pointer
 mov esi, 256       ;How many lines to flush

.flush_loop:
  lfence        ;Prevent the previous clflush to be reordered after the load
  mov eax, [rdi]    ;Touch the page
  lfence        ;Prevent the current clflush to be reordered before the load

  clflush  [rdi]    ;Flush a line
  add rdi, (1 + GAP)*64 ;Move to the next line

  dec esi
 jnz .flush_loop    ;Repeat

 lfence         ;clflush are ordered with respect of fences ..
            ;.. and lfence is ordered (locally) with respect of all instructions
 ret


;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;PROFILE THE ACCESS TO EVERY LINE OF THE BUFFER
;
;


profile:
 lea rdi, [buffer]      ;Pointer to the buffer
 mov esi, 256           ;How many lines to test
 lea r8, [timings_data]     ;Pointer to timings results


 mfence             ;I'm pretty sure this is useless, but I included it to rule out ..
                ;.. silly, hard to debug, scenarios

.profile: 
  mfence
  rdtscp
  lfence            ;Read the TSC in-order (ignoring stores global visibility)

  mov ebp, eax          ;Read the low DWORD only (this is a short delay)

  ;PERFORM THE LOADING
  mov eax, DWORD [rdi]

  rdtscp
  lfence            ;Again, read the TSC in-order

  sub eax, ebp          ;Compute the delta

  mov DWORD [r8], eax       ;Save it

  ;Advance the loop

  add r8, 4         ;Move the results pointer
  add rdi, (1 + GAP)*64     ;Move to the next line

  dec esi           ;Advance the loop
 jnz .profile

 ret

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;SHOW THE RESULTS
;
;

show_results:
 lea rbx, [timings_data]    ;Pointer to the timings
 xor r12, r12           ;Counter (up to 256)

.print_line:

 ;Format the output

 xor eax, eax
 mov esi, r12d
 lea rdi, [strNewLine]      ;Setup for a call to printf

 test r12d, 0fh
 jz .print          ;Test if counter is a multiple of 16

 lea rdi, [strHalfLine]     ;Setup for a call to printf

 test r12d, 07h         ;Test if counter is a multiple of 8
 jz .print

.print_timing:

  ;Print
  mov esi, DWORD [rbx]      ;Timing value

  ;Compute the color
  mov r10d, 60          ;Used to compute the color 
  mov eax, esi
  xor edx, edx
  div r10d          ;eax = Timing value / 78

  ;Update the color 


  add al, '0'
  mov edx, '5'
  cmp eax, edx
  cmova eax, edx
  mov BYTE [strTiming.importance], al

  xor eax, eax
  lea rdi, [strTiming]
  call printf WRT ..plt     ;Print a 3-digits number

  ;Advance the loop 

  inc r12d          ;Increment the counter
  add rbx, 4            ;Move to the next timing
  cmp r12d, 256
 jb .print_line         ;Advance the loop

  xor eax, eax
  lea rdi, [strEnd]
  call printf WRT ..plt     ;Print a new line

  ret

.print:

  call printf WRT ..plt     ;Print a string

jmp .print_timing

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;E N T R Y   P O I N T
;
;
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \

main:

 ;Flush all the lines of the buffer
 call flush_all

 ;Test the access times
 call profile

 ;Show the results
 call show_results

 ;Exit
 xor edi, edi
 call exit WRT ..plt

I was able to eliminate the false positive L3 cache hits by doing the following:

Use a relatively large GAP. If you want all accesses to miss the L3, GAP should be 63.
Remove the call to flush_all.

There is a data prefetcher other than the ones that can be disabled using sudo wrmsr -a 0x1a4 0xf¹. My understanding of how it works based on a relatively small number of experiments is as follows. It monitors the access pattern to the L3. If it detects that the L3 is being accessed in a manner that is similar (or perhaps identical) to a previously observed access pattern in the recent history, it aggressively prefetches cache lines guided by that previous access pattern (this explains why the call to flush_all should be removed irrespective of GAP). Otherwise, if there are multiple accesses to the L3 that are to the same page, it will prefetch cache lines from the next page based on how the current page is being accessed (so GAP should be large). I think this prefetcher exists in Haswell and later (the original paper that proposed the FLUSH+RELOAD attack used Ivy Bridge). So without these two changes, most accesses will hit in the L3, resulting in many false positives.

The MMU caches can have a significant impact on the measured latency. For example, by removing the call to flush_all and setting GAP to 0, the measured latency of the first access to every page is very large (over 2000 TSC cycles), indicating that the CPU experienced misses in many/all MMU caches². By also setting GAP to 63, all accesses will exhibit such extremely high latencies. If you want all accesses to hit the TLB but still miss the L3, flush_all can be used to populate the TLB, but then something must be done to make the L3 prefetcher forget about the pattern that it has observed without polluting the TLB. I've not made any effort to do that.

Also, there might be no need to fix the frequency or disable any of the prefetchers that can be disabled because the L3 miss latency can be easily distinguished from other latencies.

(1) The L2 streaming prefetcher on Haswell sometimes prefetch lines into the L3 instead of the L2. In particular, lines that are farther in the prefetch target lines are prefetched into the L3 (and then later into the L2) while lines that are closer to the line being currently accessed (for the stream being tracked) are prefetched into the L2. There might be no dedicated L3 prefetcher. The effect of sudo wrmsr -a 0x1a4 0xf on Haswell could be just disabling the L1 prefetchers completely and disabling the L2 prefetchers from prefetching lines into the L2, but perhaps they can still prefetch lines into the L3. At least on Haswell, I think it's more likely that this is what's really happening. On Ivy Bridge (the microarchitecture used by the authors of FLUSH+RELOAD), the effect of sudo wrmsr -a 0x1a4 0xf might be completely disabling all prefetchers. It could also be that the L2 streaming prefetcher on Ivy Bridge is less aggressive than on Haswell and so its effects were not very observable. It'd be interesting to know if there is an Intel patent on dedicated L3 hardware prefetching.

(2) This is most likely the latency of a soft page fault rather than a successful page walk.