I'm reading around that branch misprediction can be a hot bottleneck for the performance of an application. As I can see, people often show assembly code that unveil the problem and state that programmers usually can predict where a branch could go the most of the times and avoid branch mispredictons.
My questions are:
1- Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?
2- What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?
Code examples and benchmarks are welcome!
Branchless isn't always better, even if both sides of the branch are trivial. When branch prediction works, it's faster than a loop-carried data dependency.
See gcc optimization flag -O3 makes code slower than -O2 for a case where
gcc -O3
transforms anif()
to branchless code in a case where it's very predictable, making it slower.Sometimes you are confident that a condition is unpredictable (e.g. in a sort algorithm or binary search). Or you care more about the worst-case not being 10x slower than about the fast-case being 1.5x faster.
Some idioms are more likely to compile to a branchless form (like a
cmov
x86 conditional move instruction).The first way always writes to
x
, while the second way doesn't modifyx
in one of the branches. This seems to be the reason that some compilers tend to emit a branch instead of acmov
for theif
version. This applies even whenx
is a localint
variable that's already live in a register, so "writing" it doesn't involve a store to memory, just changing the value in a register.Compilers can still do whatever they want, but I've found this difference in idiom can make a difference. Depending on what you're testing, it's occasionally better to help the compiler mask and AND rather than doing a plain old
cmov
. I did it in that answer because I knew that the compiler would have what it needed to generate the mask with a single instruction (and from seeing how clang did it).TODO: examples on http://gcc.godbolt.org/
(*) Experienced programmers often remind that human programmers are very bad at predicting that.
Not in standard c++ or c. At least not for a single branch. What you can do is minimize the depth of your dependency chains so that branch mis-prediction would not have any effect. Modern cpus will execute both code paths of a branch and drop the one that wasn't chosen. There's a limit to this however, which is why branch prediction only matters in deep dependency chains.
Some compilers provide extension for suggesting the prediction manually such as __builtin_expect in gcc. Here is a stackoverflow question about it. Even better, some compilers (such as gcc) support profiling the code and automatically detect the optimal predictions. It's smart to use profiling rather than manual work because of (*).
Primarily, you should keep in mind that branch mis-prediction is only going to affect you in the most performance critical part of your program and not to worry about it until you've measured and found a problem.
Lundin gave very sensible advice
Order of 2. and 3. may be switched. Optimizing your code by hand is a lot of work. On the other hand, gathering the profiling data can be difficult for some programs as well.
(**) One way to do that is transform your loops by for example unrolling them. You can also let the optimizer do it automatically. You must measure though, because unrolling will affect the way you interact with the cache and may well end up being a pessimization.
In general it's a good idea to keep hot inner loops well proportioned to the cache sizes most commonly encountered. That is, if your program handles data in lumps of, say, less than 32kbytes at a time and does a decent amount of work on it then you're making good use of the L1 cache.
In contrast if your hot inner loop chews through 100MByte of data and performs only one operation on each data item, then the CPU will spend most of the time fetching data from DRAM.
This is important because part of the reason CPUs have branch prediction in the first place is to be able to pre-fetch operands for the next instruction. The performance consequences of a branch mis-prediction can be reduced by arranging your code so that there's a good chance that the next data comes from L1 cache no matter what branch is taken. Whilst not a perfect strategy, L1 cache sizes seem to be universally stuck on 32 or 64K; it's almost a constant thing across the industry. Admittedly coding in this way is not often straightforward, and relying on profile driven optimisation, etc. as recommended by others is probably the most straightforward way ahead.
Regardless of anything else, whether or not a problem with branch mis-prediction will occur varies according to the CPU's cache sizes, what else is running on the machine, what the main memory bandwidth / latency is, etc.
To answer your questions let me explain how does branch prediction works.
First of all, there is a branch penalty when the processor correctly predicts the taken branches. If the processor predicts a branch as taken, then it has to know the target of the predicted branch since execution flow will continue from that address. Assuming that the branch target address is already stored in Branch Target Buffer(BTB), it has to fetch new instructions from the address found in BTB. So you are still wasting a few clock cycles even if the branch is correctly predicted.
Since BTB has an associative cache structure the target address might not be present, and hence more clock cycles might be wasted.
On the other, hand if the CPU predicts a branch as not taken and if it's correct then there is no penalty since the CPU already knows where the consecutive instructions are.
As I explained above, predicted not taken branches have higher throughput than predicted taken branches.
Yes, it is possible. You can avoid by organizing your code in way that all branches have repetitive branch pattern such that always taken or not taken.
But if you want to get higher throughput you should organize branches in a way that they are most likely to be not taken as I explained above.
If it's possible eliminate branches as possible. If this is not the case when writing if-else or switch statements, check the most common cases first to make sure the branches most likely to be not taken. Try to use _
_builtin_expect(condition, 1)
function to force compiler to produce condition to be treated as not taken.Perhaps the most common techniques is to use separate methods for normal and error returns. C has no choice, but C++ has exceptions. Compilers are aware that the exception branches are exceptional and therefore unexpected.
This means that exception branches are indeed slow, as they're unpredicted, but the non-error branch is made faster. On average, this is a net win.
Linux kernel defines
likely
andunlikely
macros based on__builtin_expect
gcc builtins:(See here for the macros definitions in
include/linux/compiler.h
)You can use them like:
or