I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen()
, memset()
, etc, using x86-64 assembly with SSE-2 instructions.
So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more.
For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely no reason in terms of code.
So my guess is that there is some issues with code alignment, and/or with branches which get mispredicted.
I know that, even with the same architecture (x86-64), different CPUs have different algorithms for branch prediction.
But is there some general advices, when developing for high performances on x86-64, about code alignment and branch prediction?
In particular, about alignment, should I ensure all labels used by jump instructions are aligned on a DWORD?
_func:
; ... Some code ...
test rax, rax
jz .label
; ... Some code ...
ret
.label:
; ... Some code ...
ret
In the previous code, should I use an align directive before .label:
, like:
align 4
.label:
If so, is it enough to align on a DWORD when using SSE-2?
And about branch prediction, is there a «preffered» way to organize the labels used by jump instructions, in order to help the CPU, or are today's CPUs smart enough to determine that at runtime by counting the number of times a branch is taken?
EDIT
Ok, here's a concrete example - here's the start of strlen()
with SSE-2:
_strlen64_sse2:
mov rsi, rdi
and rdi, -16
pxor xmm0, xmm0
pcmpeqb xmm0, [ rdi ]
pmovmskb rdx, xmm0
; ...
Running it 10'000'000 times with a 1000 character string gives about 0.48 seconds, which is fine.
But it does not check for a NULL string input. So obviously, I'll add a simple check:
_strlen64_sse2:
test rdi, rdi
jz .null
; ...
Same test, it runs now in 0.59 seconds. But if I align the code after this check:
_strlen64_sse2:
test rdi, rdi
jz .null
align 8
; ...
The original performances are back. I used 8 for alignment, as 4 doesn't change anything.
Can anyone explain this, and give some advices about when to align, or not to align code sections?
EDIT 2
Of course, it's not as simple as aligning every branch target. If I do it, performances will usually get worse, unless some specific cases like above.
Alignment optimisations
1. Use
.p2align <abs-expr> <abs-expr> <abs-expr>
instead ofalign
.Grants fine-grained control using its 3 params
NOP
s).2. Align the start of a frequently used code blocks to cache-line size boundaries.
3. Use multi-byte
NOP
s for padding to reduce the time spent executingNOP
s.(upto 10byte
NOP
s for x86. Source binutils-2.2.3.)Branch-prediction optimisations
Lot of variations between x86_64 micro-architectures/generations. However a common set of guidelines that are applicable for all of them can be summarised as follows. Reference : Section 3 of Agner Fog's x86 micro-architecture manual.
1. Stick to short jumps.
2. Un-roll large loops.
The "branch targets should be 16 byte aligned rule" is not an absolute. The reason for the rule is that with 16 byte alignment, 16 bytes of instructions can be read in one cycle, and then another 16 bytes in the next cycle. If your target is at offset 16n + 2, then the processor can still read 14 bytes of instructions (the remainder of the cache line) in one cycle, and that is often good enough. Starting a loop at offset 16n + 15 however is a bad idea, since only one instruction byte can be read at a time. More useful is to keep the whole loop in the smallest number of cache lines possible.
On some processors branch prediction has the odd behaviour that all branches within 8 or 4 bytes use the same branch predictor. Move branches so that each conditional branch uses its own branch predictor.
What both of these have in common is that inserting some bits of code can change the behaviour and make it faster or slower.
To get a better understanding of why and how alignment matters, check out Agner Fog's the microarchitecture doc, esp. the section about the instruction-fetch front-end of various CPU designs. Sandybridge introduced the uop cache, which makes a huge different to throughput, esp. in SSE code where instruction length is often too long for 16B per cycle to cover 4 instructions.
The rules for filling uop cache lines are complicated, but a new block of 32B of instructions always starts a new cache line, IIRC. So aligning hot function entry points to 32B is a good idea. That much padding in other cases might be hurting I$ density more than helping. (L1 I$ still has 64B cache lines, though, so some things might hurt L1 I$ density while helping uop cache density.)
The loop buffer helps too, but taken branches disrupt the 4 uops per cycle. e.g. a loop of 3 uops executes like
abc
,abc
, notabca
,bcda
. So a 5-uop loop goes at one iteration per 2 cycles, not one per 1.25. This makes unrolling even more valuable.To extends on TheCodeArtist's answer, who made some good points, here are a few additional stuff and details, as I was actually able to solve the problem.
1 - Code alignment
Intel recommends aligning code and branch targets on 16-byte boundaries:
While this is usually a good advice, it should be done carefully.
Blindly 16-byte aligning everything may lead to performance lost, so this should be tested on each branch target before applying.
As TheCodeArtist pointed it out, using multi-byte NOPs may help here, as simply using standard one-byte NOPs may not bring the expected performance gain of code alignment.
As a sidenote, the
.p2align
directive is not available in NASM or YASM.But they do support alignment with other instructions than NOPs with the standard
align
directive:2 . Branch prediction
This turned out to be the most important part.
While it's right that every generation of x86-64 CPUs have different branch prediction algorithms, some simple rules can be applied generally to help the CPU predicting which branch will likely be taken.
The CPU tries to keep a branching history in the BTB (Branch Target Buffer).
But when branch information is not available in the BTB, the CPU will use what they call static prediction, which obey to simple rules, as mentioned in Intel's manuals:
Here's an example for the first case:
Instructions under
.label
is the unlikely condition, because.label
is declared after the actual branch.For the second case:
Here, the instructions under
.label
are the likely condition, as.label
is declared before the actual branch.So each conditional branch should always follow this simple pattern.
And of course, this is also suitable for loops.
As I mentioned before, this was the most important part.
I was experiencing unpredictable performance gains or losses while adding simple tests that should logically improve the overall performances.
Sticking blindly to these rules solved the issues.
If not, the addition of a branch for optimisation purpose may have the opposite result.
TheCodeArtist also mentions loop unrolling in his answer.
While this wasn't the issue, as my loops were already unrolled, I mention it here as it's indeed extremely important, and brings substantial performance gains.
And as a last note for the readers, while this may seem obvious and wasn't the problem here, don't branch when unnecessary.
Starting with the Pentium Pro, x86 processors have conditional move instructions, which may help to eliminate branching and suppress the risk of misprediction:
So just in case, nice thing to keep in mind.