I'm trying to speed up a variable-bitwidth integer compression scheme and I'm interested in generating and executing assembly code on-the-fly. Currently a lot of time is spent on mispredicted indirect branches, and generating code based on the series of bitwidths as found seems to be the only way avoid this penalty.
The general technique is referred to as "subroutine threading" (or "call threading", although this has other definitions as well). The goal is to take advantage of the processors efficient call/ret prediction so as to avoid stalls. The approach is well described here: http://webdocs.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-berndl.pdf
The generated code will be simply a series of calls followed by a return. If there were 5 'chunks' of widths [4,8,8,4,16], it would look like:
call $decode_4
call $decode_8
call $decode_8
call $decode_4
call $decode_16
ret
In actual use, it will be a longer series of calls, with a sufficient length that each series will likely be unique and only called once. Generating and calling the code is well documented, both here and elsewhere. But I haven't found much discussion of efficiency beyond a simple "don't do it" or a well-considered "there be dragons". Even the Intel documentation speaks mostly in generalities:
8.1.3 Handling Self- and Cross-Modifying Code
The act of a processor writing data into a currently executing code segment with the intent of executing that data as code is called self-modifying code. IA-32 processors exhibit model-specific behavior when executing self modified code, depending upon how far ahead of the current execution pointer the code has been modified. ... Self-modifying code will execute at a lower level of performance than non-self-modifying or normal code. The degree of the performance deterioration will depend upon the frequency of modification and specific characteristics of the code.
11.6 SELF-MODIFYING CODE
A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a snoop of an instruction in a code segment, where the target instruction is already decoded and resident in the trace cache, invalidates the entire trace cache. The latter behavior means that programs that self-modify code can cause severe degradation of performance when run on the Pentium 4 and Intel Xeon processors.
While there is a performance counter to determine whether bad things are happening (C3 04 MACHINE_CLEARS.SMC: Number of self-modifying-code machine clears detected) I'd like to know more specifics, particularly for Haswell. My impression is that as long as I can write the generated code far enough ahead of time that the instruction prefetch has not gotten there yet, and as long as I don't trigger the SMC detector by modifying code on the same page (quarter-page?) as anything currently being executed, then I should get good performance. But all the details seem extremely vague: how close is too close? How far is far enough?
Trying to make these into specific questions:
What is the maximum distance ahead of the current instruction that the Haswell prefetcher ever runs?
What is the maximum distance behind the current instruction that the Haswell 'trace cache' might contain?
What is the actual penalty in cycles for a MACHINE_CLEARS.SMC event on Haswell?
How can I run the generate/execute cycle in a predicted loop while preventing the prefetcher from eating its own tail?
How can I arrange the flow so that each piece of generated code is always "seen for the first time" and not stepping on instructions already cached?
This doesn't have to be self-modifying code at all - it can be dynamically created code instead, i.e. runtime-generated "trampolines".
Meaning you keep a (global) function pointer around that'll redirect to a writable/executable mapped section of memory - in which you then actively insert the function calls you wish to make.
The main difficulty with this is that
call
is IP-relative (as are mostjmp
), so that you'll have to calculate the offset between the memory location of your trampoline and the "target funcs". That as such is simple enough - but combine that with 64bit code, and you run into the relative displacement thatcall
can only deal with displacements in the range of +-2GB, it becomes more complex - you'd need to call through a linkage table.So you'd essentially create code like (/me severely UN*X biased, hence AT&T assembly, and some references to ELF-isms):
This can be created at compile time (just make a writable text section), or dynamically at runtime.
You then, at runtime, patch the jump targets. That's similar to how the
.plt
ELF Section (PLT = procedure linkage table) works - just that there, it's the dynamic linker which patches the jmp slots, while in your case, you do that yourself.If you go for all runtime, then table like the above is easily creatable through C/C++ even; start with a data structures like:
The only critical and somewhat system-dependent thing here is the "packed" nature of this that one needs to tell the compiler about (i.e. not to pad the
call
array out), and that one should cacheline-align the jump table.You need to make
calltbl[i].call_displacement = (int32_t)(&jmptbl[i]-&calltbl[i+1])
, initialize the empty/unused jump table withmemset(&jmptbl, 0xC3 /* RET */, sizeof(jmptbl))
and then just fill the fields with the jump opcode and target address as you need.Very good question, but the answer is not so easy... Probably the final word will be for the experiment - common case in modern world of different architectures.
Anyway, what you want to do is not exactly self modifying code. The procedures "decode_x" will exists and will not to be modified. So, there should be no problems with the cache.
On the other hand, the memory allocated for the generated code, probably will be dynamically allocated from the heap, so, the addresses will be far enough from the executable code of the program. You can allocate new block every time you need to generate new call sequence.
How far is enough? I think that this is not so far. The distance should be probably a multiply of the cache line of the processor and this way, not so big. I have something like 64bytes (for L1). In the case of dynamically allocated memory you will have many of pages a distance.
The main problem in this approach IMO is that the code of the generated procedures will be executed only once. This way, the program will lost the main advance of the cached memory model - efficient execution of cycling code.
And at the end - the experiment does not look so hard to be made. Just write some test program in both variants and measure the performance. And if you publish these results I will read them carefully. :)
This is less in the scope of SMC and more into Dynamic Binary Optimization, i.e. - you don't really manipulate the code you're running (as in writing new instructions), you can just generate a different piece of code, and reroute the appropriate call in your code to jump there instead. The only modification is at the entry point, and it's only done once, so you don't need to worry too much about the overhead (it usually means flushing all the pipelines to make sure the old instruction isn't still alive anywhere in the machine, i'd guess the penalty is a few hundreds of clock cycles, depending on how loaded the CPU is. Only relevant if it's occurring repeatedly).
In the same sense, you shouldn't worry too much about doing this ahead enough of time. By the way, regarding your question - the CPU would only be able to start executing ahead as far its ROB size, which in haswell is 192 uop (not instructions, but close enough), according to this - http://www.realworldtech.com/haswell-cpu/3/ , and would be able to see slightly further ahead thanks to the predictor and fetch units, so we're talking about overall of let's say a few hundreds).
Having that said, let me reiterate what was said here before - experiment, experiment experiment :)
I found some better documentation from Intel and this seemed like the best place to put it for future reference:
Intel® 64 and IA-32 Architectures Optimization Reference Manual
It's only a partial answer to the questions (test, test, test) but firmer numbers than the other sources I had found.