Let's say I have a function that I plan to execute as part of a benchmark. I want to bring this code into the L1 instruction cache prior to executing since I don't want to measure the cost of I$ misses as part of the benchmark.
The obvious way to do this is to simply execute the code at least once before the benchmark, hence "warming it up" and bringing it into the L1 instruction cache and possibly the uop cache, etc.
What are my alternatives in the case I don't want to execute the code (e.g., because I want the various predictors which key off of instruction addresses to be cold)?
One approach that could work for small functions would be to execute some code which appears on the same cache line(s) as your target function, which will bring in the entire cache line.
For example, you could organize your code as follows:
and then call the
dummy
function prior to callingfunction_under_test
- ifdummy
starts on the same cache line as the target function, it would bring the entire cache line into L1I. This works for functions of 63 bytes or less1.This can probably be extended to functions up to ~126 bytes or so by using this trick both at before2 and after the target function. You could extend it to arbitrarily sized functions by inserting dummy functions on every cache line and having the target code jump over them, but this comes at a cost of inserting the otherwise-unnecessary jumps your code under test, and requires careful control over the code size so that the dummy functions are placed correctly.
You need fine control over function alignment and placement to achieve this: assembler is probably the easiest, but you can also probably do it with C or C++ in combination with compiler-specific attributes.
1 You could even reuse the
ret
in thefunction_under_test
itself to support slightly longer functions (e.g., those whoseret
starts within 64 bytes of the start).2 You'd have to be more careful about the dummy function appearing before the code under test: the processor might fetch instructions past the
ret
and it might (?) even execute them. Aud2
after the dummyret
is likely to block further fetch (but you might want fetch if populating the uop cache is important).Caveat: some of this answer is Intel-centric. If I just say "the uop cache", I'm talking about Sandybridge-family. I know Ryzen has a uop-cache too, but I haven't read much of anything about its internals, and only know some basics about Ryzen from reading Agner Fog's microarch guide.
You can at least prefetch into L2 with software prefetch, but that doesn't even necessarily help with iTLB hits. (The 2nd-level TLB is a victim cache, I think, so a dTLB miss might not populate anything that the iTLB checks.)
But this doesn't help with L1I$ misses, or getting the target code decoded into the uop cache.
If there is a way to do what you want, it's going to be with some kind of trick. x86 has no "standard" way to do this; no code-prefetch instruction. Darek Mihoka wrote about code-prefetch as part of a CPU-emulator interpreter loop: The Common CPU Interpreter Loop Revisited: the Nostradamus Distributor back in 2008 when P4 and Core2 were the hot CPU microarchitectures to tune for.
But of course this is a different case: the goal is sustained performance of indirect branches, not priming things for a benchmark. It doesn't matter if you spend a lot of time achieving the microarchitectural state you want outside the timed portion of a micro-benchmark.
Speaking of which, modern branch predictors aren't just "cold", they always contain some prediction based on aliasing1. This may or may not be important.
Prefetch the first / last lines (and maybe others) with
call
to aret
I think instruction fetch / prefetch normally continues past an ordinary
ret
orjmp
, because it can't be detected until decode. So you could just call a function that ends at the end of the previous cache line. (Make sure they're in the same page, so an iTLB miss doesn't block prefetch.)ret
after acall
will predict reliably if no othercall
did anything to the return-address predictor stack, except in rare cases if an interrupt happened between the call and ret, and the kernel code had a deep enough call tree to push the prediction for thisret
out of the RSB (return-stack-buffer). Or if Spectre mitigation flushed it intentionally on a context switch.We can even extend this technique to more than 2 cache lines by
call
ing to any0xC3
(ret
) byte inside the body offunction_under_test
. But as @BeeOnRope points out, that's dangerous because it may prime branch prediction with "there's aret
here" causing a mispredict you otherwise wouldn't have had when callingfunction_under_test
for real.Early in the front-end, branch prediction is needed based on fetch-block address (which block to fetch after this one), not on individual branches inside each block, so this could be a problem even if the
ret
byte was part of another instruction.But if this idea is viable, then you can look for a 0xc3 byte as part of an instruction in the cache line, or at worst add a 3-byte
NOP r/m32
(0f 1f c3 nop ebx,eax
).c3
as a ModR/M encodes a reg,reg instruction (with ebx and eax as operands), so it doesn't have to be hidden in a disp8 to avoid making the NOP even longer, and it's easy to find in short instructions: e.g.89 c3 mov ebx,eax
, or use the other opcode so the same modrm byte gives youmov eax,ebx
. Or83 c3 01 add ebx,0x1
, or many other instructions with e/rbx, bl (and r/eax or al).With a REX prefix, those can be you have a choice of rbx / r11 (and rax/r8 for the /r field if applicable). It's likely you can choose (or modify for this microbenchmark) your register allocation to produce an instruction using the relevant registers to produce a
c3
byte without any overhead at all, especially if you can use a custom calling convention (at least for testing purposes) so you can clobberrbx
if you weren't already saving/restoring it.I found these by searching for (space)
c3
(space) in the output ofobjdump -d /bin/bash
, just to pick a random not-small executable full of compiler-generated code.Evil hack: end the cache line before with the start of a multi-byte instruction.
So it jumps to a virtual address which depends on the instruction bytes of
function_under_test
. Map that page and put code in it that jumps back to your benchmark-prep code. The destination has to be within 2GiB, so (in 64-bit code) it's always possible to choose a virtual address forfunction_under_test
that makes the destination a valid user-space virtual address. Actually, for manyrel32
values, it's possible to choose the address offunction_under_test
to keep both it and the target within the low 2GiB of virtual address space, (and definitely 3GiB) and thus valid 32-bit user-space addresses even under a 32-bit kernel.This whole idea is probably unnecessary, because I think the above method can reliably prefetch the first line of a function anyway, and this only works for the first line. (Unless you put a
0xe9
in the last byte of each relevant cache line, e.g. as part of a NOP if necessary.)But this does give you a way to non-speculatively do code-fetch from from the cache line you want without architecturally executing any instructions in it. Hopefully a direct
jmp
is detected before any later instructions execute, and probably without any others even decoding, except ones that decoded in the same block. (And an unconditionaljmp
always ends a uop-cache line on Intel). i.e. the mispredict penalty is all in the front-end from re-steering the fetch pipeline as soon as decode detects thejmp
. I hoperet
is like this too, in cases where the return-predictor stack is not empty.A
jmp r/m64
would let you control the destination just by putting the address in the right register or memory. (Figure out what register or memory addressing mode the first byte(s) offunction_under_test
encode, and put an address there). The opcode isFF /4
, so you can only use a register addressing mode if the first byte works as a ModRM that has/r = 4
and mode=11b
. But you could put the first 2 bytes of thejmp r/m64
in the previous line, so the extra bytes form the SIB (and disp8 or disp32). Whatever those are, you can set up register contents such that the jump-target address will be loaded from somewhere convenient.But the key problem with a
jmp r/m64
is that default-prediction for an indirect branch can fall through and speculatively executefunction_under_test
, affecting the branch-prediction entries for those branches. You could have bogus values in registers so you prime branch prediction incorrectly, but that's still different from not touching them at all.How does this overlapping-instructions hack to consume bytes from the target cache line affect the uop cache? I think Intel's uop cache puts instructions in the uop-cache line that corresponds to their start address, in cases where they cross a 32 or 64-byte boundary. So when the real execution of
function_under_test
begins, it will simply miss in the uop-cache because no uop-cache line is caching the instruction-start-address range that includes the first byte offunction_under_test
. i.e. the overlapping decode is probably not even noticed when it's split across an L1I$ boundary this way. (I haven't tested my mental model of the uop cache for this, though. I'm mostly assuming that lines record which range of start-addresses they cache, and not the whole range of x86 instruction bytes they're caching.)Create mis-speculation to fetch arbitrary lines, but block exec with
lfence
Spectre / Meltdown exploits and mitigation strategies provide some interesting ideas: you could maybe trigger a mispredict that fetches at least the start of the code you want, but maybe doesn't speculate into it.
lfence
blocks speculative execution, but (AFAIK) not instruction prefetch / fetch / decode.I think (and hope) the front-end will follow direct relative jumps on its own, even after
lfence
, so we can usejmp target_cache_line
in the shadow of a mispredict +lfence
to fetch and decode but not execute the target function.If
lfence
works by blocking the issue stage until the reservation station (OoO scheduler) is empty, then an Intel CPU should probably decode pastlfence
until the IDQ is full (64 uops on Skylake). There are further buffers before other stages (fetch -> instruction-length-decode, and between that and decode), so fetch can run ahead of that. Presumably there's a HW prefetcher that runs ahead of where actual fetch is reading from, so it's pretty plausible to get several cache lines into the target function in the shadow of a single mispredict, especially if you introduce delays before the mispredict can be detected.We can use the same return-address frobbing as a retpoline to reliably trigger a mispredict to
jmp rel32
which sends fetch into the target function. (I'm pretty sure a re-steer of the front-end can happen in the shadow of speculative execution without waiting to confirm correct speculation, because that would make every re-steer serializing.)I'm not sure the
lfence
inret_frob
is needed. But it does make it easier to reason about what the front-end is doing relative to the back-end. After thelfence
, the return address has a data dependency on the chain of 10xsqrtpd
. (10x 15 to 16 cycle latency on Skylake, for example). But the 10x sqrtpd + orps + movq only take 3 cycles to issue (on 4-wide CPUs), leaving at least 148 cycles + store-forwarding latency beforeret
can read the return address back from the stack and discover that the return-stack prediction was wrong.This should be plenty of time for the front-end to follow the
jmp some_line
and load that line into L1I$, and probably load several lines after that. It should also get some of them decoded into the uop cache.You need a separate
call
/lfence
/jmp
block for each target line (because the target address has to be hard-coded into a direct jump for the front-end to follow it without the back-end executing anything), but they can all share the sameret_frob
block.If you left out the
lfence
, you could use the aboveretpoline
-like technique to trigger speculative execution into the function. This would let you jump to any target branch in the target function with whatever args you like in registers, so you can mis-prime branch prediction however you like.Footnote 1:
Modern branch predictors aren't just "cold", they contain predictions from whatever aliased the target virtual addresses in the various branch-prediction data structures. (At least on Intel where SnB-family pretty definitely uses TAGE prediction.)
So you should decide whether you want to specifically anti-prime the branch predictors by (speculatively) executing the branches in your function with bogus data in registers / flags, or whether your micro-benchmarking environment resembles the surrounding conditions of the real program closely enough.
If your target function has enough branching in a very specific complex pattern (like a branchy sort function over 10 integers), then presumably only that exact input can train the branch predictor well, so any initial state other than a specially-warmed-up state is probably fine.
You may not want the uop-cache primed at all, to look for cold-execution effects in general (like decode), so that might rule out any speculative fetch / decode, not just speculative execution. Or maybe speculative decode is ok if you then run some uop-cache-polluting long-NOPs or
times 800 xor eax,eax
(2-byte instructions -> 16 per 32-byte block uses up all 3 entries that SnB-family uop caches allow without running out of room and not being able to fit in the uop cache at all). But not so many that you evict L1I$ as well.Even speculative decode without execute will prime the front-end branch prediction that knows where branches are ahead of decode, though. I think that a
ret
(orjmp rel32
) at the end of the previous cache lineMap the same physical page to two different virtual addresses.
L1I$ is physically addressed. (VIPT but with all the index bits from below the page offset, so effectively PIPT).
Branch-prediction and uop caches are virtually addressed, so with the right choice of virtual addresses, a warm-up run of the function at the alternate virtual address will prime L1I, but not branch prediction or uop caches. (This only works if branch aliasing happens modulo something larger than 4096 bytes, because the position within the page is the same for both mappings.)
Prime the iTLB by
call
ing to aret
in the same page as the test function, but outside it.After setting this up, no modification of the page tables are required between the warm-up run and the timing run. This is why you use two mappings of the same page instead of remapping a single mapping.
Margaret Bloom suggests that CPUs vulnerable to Meltdown might speculatively fetch instructions from a no-exec page if you jump there (in the shadow of a mispredict so it doesn't actually fault), but that would then require changing the page table, and thus a system call which is expensive and might evict that line of L1I. But if it doesn't pollute the iTLB, you could then re-populate the iTLB entry with a mispredicted branch anywhere into the same page as the function. Or just a call to a dummy ret outside the function in the same page.
None of this will let you get the uop cache warmed up, though, because it's virtually addressed. OTOH, in real life, if branch predictors are cold then probably the uop cache will also be cold.