Branch-aware programming

2019-03-09 16:01发布

I'm reading around that branch misprediction can be a hot bottleneck for the performance of an application. As I can see, people often show assembly code that unveil the problem and state that programmers usually can predict where a branch could go the most of the times and avoid branch mispredictons.

My questions are:

1- Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?

2- What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?

Code examples and benchmarks are welcome!

8条回答
放我归山
2楼-- · 2019-03-09 16:32

Branchless isn't always better, even if both sides of the branch are trivial. When branch prediction works, it's faster than a loop-carried data dependency.

See gcc optimization flag -O3 makes code slower than -O2 for a case where gcc -O3 transforms an if() to branchless code in a case where it's very predictable, making it slower.

Sometimes you are confident that a condition is unpredictable (e.g. in a sort algorithm or binary search). Or you care more about the worst-case not being 10x slower than about the fast-case being 1.5x faster.


Some idioms are more likely to compile to a branchless form (like a cmov x86 conditional move instruction).

x = x>limit ? limit : x;   // likely to compile branchless

if (x>limit) x=limit;      // less likely to compile branchless, but still can

The first way always writes to x, while the second way doesn't modify x in one of the branches. This seems to be the reason that some compilers tend to emit a branch instead of a cmov for the if version. This applies even when x is a local int variable that's already live in a register, so "writing" it doesn't involve a store to memory, just changing the value in a register.

Compilers can still do whatever they want, but I've found this difference in idiom can make a difference. Depending on what you're testing, it's occasionally better to help the compiler mask and AND rather than doing a plain old cmov. I did it in that answer because I knew that the compiler would have what it needed to generate the mask with a single instruction (and from seeing how clang did it).

TODO: examples on http://gcc.godbolt.org/

查看更多
贼婆χ
3楼-- · 2019-03-09 16:41

people often ... and state that programmers usually can predict where a branch could go

(*) Experienced programmers often remind that human programmers are very bad at predicting that.

1- Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?

Not in standard c++ or c. At least not for a single branch. What you can do is minimize the depth of your dependency chains so that branch mis-prediction would not have any effect. Modern cpus will execute both code paths of a branch and drop the one that wasn't chosen. There's a limit to this however, which is why branch prediction only matters in deep dependency chains.

Some compilers provide extension for suggesting the prediction manually such as __builtin_expect in gcc. Here is a stackoverflow question about it. Even better, some compilers (such as gcc) support profiling the code and automatically detect the optimal predictions. It's smart to use profiling rather than manual work because of (*).

2- What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?

Primarily, you should keep in mind that branch mis-prediction is only going to affect you in the most performance critical part of your program and not to worry about it until you've measured and found a problem.

But what can I do when some profiler (valgrind, VTune, ...) tells that on line n of foo.cpp I got a branch prediction penalty?

Lundin gave very sensible advice

  1. Measure fo find out whether it matters.
  2. If it matters, then
    • Minimize the depth of dependency chains of your calculations. How to do that can be quite complicated and beyond my expertise and there's not much you can do without diving into assembly. What you can do in a high level language is to minimize the number of conditional checks (**). Otherwise you're at the mercy of compiler optimization. Avoiding deep dependency chains also allows more efficient use of out-of-order superscalar processors.
    • Make your branches consistently predictable. The effect of that can be seen in this stackoverflow question. In the question, there is a loop over an array. The loop contains a branch. The branch depends on size of the current element. When the data was sorted, the loop could be demonstrated to be much faster when compiled with a particular compiler and run on a particular cpu. Of course, keeping all your data sorted will also cost cpu time, possibly more than the branch mis-predictions do, so, measure.
  3. If it's still a problem, use profile guided optimization (if available).

Order of 2. and 3. may be switched. Optimizing your code by hand is a lot of work. On the other hand, gathering the profiling data can be difficult for some programs as well.

(**) One way to do that is transform your loops by for example unrolling them. You can also let the optimizer do it automatically. You must measure though, because unrolling will affect the way you interact with the cache and may well end up being a pessimization.

查看更多
Root(大扎)
4楼-- · 2019-03-09 16:50

In general it's a good idea to keep hot inner loops well proportioned to the cache sizes most commonly encountered. That is, if your program handles data in lumps of, say, less than 32kbytes at a time and does a decent amount of work on it then you're making good use of the L1 cache.

In contrast if your hot inner loop chews through 100MByte of data and performs only one operation on each data item, then the CPU will spend most of the time fetching data from DRAM.

This is important because part of the reason CPUs have branch prediction in the first place is to be able to pre-fetch operands for the next instruction. The performance consequences of a branch mis-prediction can be reduced by arranging your code so that there's a good chance that the next data comes from L1 cache no matter what branch is taken. Whilst not a perfect strategy, L1 cache sizes seem to be universally stuck on 32 or 64K; it's almost a constant thing across the industry. Admittedly coding in this way is not often straightforward, and relying on profile driven optimisation, etc. as recommended by others is probably the most straightforward way ahead.

Regardless of anything else, whether or not a problem with branch mis-prediction will occur varies according to the CPU's cache sizes, what else is running on the machine, what the main memory bandwidth / latency is, etc.

查看更多
【Aperson】
5楼-- · 2019-03-09 16:52

To answer your questions let me explain how does branch prediction works.

First of all, there is a branch penalty when the processor correctly predicts the taken branches. If the processor predicts a branch as taken, then it has to know the target of the predicted branch since execution flow will continue from that address. Assuming that the branch target address is already stored in Branch Target Buffer(BTB), it has to fetch new instructions from the address found in BTB. So you are still wasting a few clock cycles even if the branch is correctly predicted.
Since BTB has an associative cache structure the target address might not be present, and hence more clock cycles might be wasted.

On the other, hand if the CPU predicts a branch as not taken and if it's correct then there is no penalty since the CPU already knows where the consecutive instructions are.

As I explained above, predicted not taken branches have higher throughput than predicted taken branches.

Is it possible to avoid branch misprediction using some high level programming technique (i.e. no assembly)?

Yes, it is possible. You can avoid by organizing your code in way that all branches have repetitive branch pattern such that always taken or not taken.
But if you want to get higher throughput you should organize branches in a way that they are most likely to be not taken as I explained above.

What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?

If it's possible eliminate branches as possible. If this is not the case when writing if-else or switch statements, check the most common cases first to make sure the branches most likely to be not taken. Try to use __builtin_expect(condition, 1) function to force compiler to produce condition to be treated as not taken.

查看更多
啃猪蹄的小仙女
6楼-- · 2019-03-09 16:55

Perhaps the most common techniques is to use separate methods for normal and error returns. C has no choice, but C++ has exceptions. Compilers are aware that the exception branches are exceptional and therefore unexpected.

This means that exception branches are indeed slow, as they're unpredicted, but the non-error branch is made faster. On average, this is a net win.

查看更多
唯我独甜
7楼-- · 2019-03-09 16:56

Linux kernel defines likely and unlikely macros based on __builtin_expect gcc builtins:

    #define likely(x)   __builtin_expect(!!(x), 1)
    #define unlikely(x) __builtin_expect(!!(x), 0)

(See here for the macros definitions in include/linux/compiler.h)

You can use them like:

if (likely(a > 42)) {
    /* ... */
} 

or

if (unlikely(ret_value < 0)) {
    /* ... */
}
查看更多
登录 后发表回答