In the latest Intel software dev manual it describes two opcode prefixes:
Group 2 > Branch Hints
0x2E: Branch Not Taken
0x3E: Branch Taken
These allow for explicit branch prediction of Jump instructions (opcodes likeJxx
)
I remember reading a couple of years ago that on x86 explicit branch prediction was essentially a no-op in the context of gccs branch prediciton intrinsics.
I am now unclear if these x86 branch hints are a new feature or whether they are essentially no-ops in practice.
Can anyone clear this up?
(That is: Does gccs branch prediction functions generate these x86 branch hints? - and do current Intel CPUs not ignore them? - and when did this happen?)
Update:
I created a quick test program:
int main(int argc, char** argv)
{
if (__builtin_expect(argc,0))
return 1;
if (__builtin_expect(argc == 2, 1))
return 2;
return 3;
}
Disassembles to the following:
00000000004004cc <main>:
4004cc: 55 push %rbp
4004cd: 48 89 e5 mov %rsp,%rbp
4004d0: 89 7d fc mov %edi,-0x4(%rbp)
4004d3: 48 89 75 f0 mov %rsi,-0x10(%rbp)
4004d7: 8b 45 fc mov -0x4(%rbp),%eax
4004da: 48 98 cltq
4004dc: 48 85 c0 test %rax,%rax
4004df: 74 07 je 4004e8 <main+0x1c>
4004e1: b8 01 00 00 00 mov $0x1,%eax
4004e6: eb 1b jmp 400503 <main+0x37>
4004e8: 83 7d fc 02 cmpl $0x2,-0x4(%rbp)
4004ec: 0f 94 c0 sete %al
4004ef: 0f b6 c0 movzbl %al,%eax
4004f2: 48 85 c0 test %rax,%rax
4004f5: 74 07 je 4004fe <main+0x32>
4004f7: b8 02 00 00 00 mov $0x2,%eax
4004fc: eb 05 jmp 400503 <main+0x37>
4004fe: b8 03 00 00 00 mov $0x3,%eax
400503: 5d pop %rbp
400504: c3 retq
400505: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40050c: 00 00 00
40050f: 90 nop
I don't see 2E or 3E ? Maybe gcc has elided them for some reason?
Intel® 64 and IA-32 Architectures Software Developer’s Manual -> Volume 2: Instruction Set Reference, A-Z -> Chapter 2: Instruction Format -> 2.1 Instruction Format for Protected Mode, real-address Mode, and virtual-8086 mode -> 2.1.1 Instruction Prefixes
While Pentium 4 is the only generation which actually respects the branch-hint instructions, most CPUs do have some form of static branch prediction, which can be used to achieve the same effect. This answer is a bit tangential to the original question, but I think this would be valuable information to anyone who comes to this page.
The Intel optimisation guide and Agner Fog's guide (which have been mentioned here already) both have excellent descriptions of this feature.
Intel has this to say about generations newer than Core 2:
So conditional branches which jump forward in the code are predicted to be not-taken, by the static prediction algorithm.
This is consistent with what GCC seems to have generated using
__builtin_expect
: the 'expected'return 1
/return 2
code is placed in the not-taken paths from the conditional branches, which will be statically predicted as not-taken.Additionally:
So in the 'expected' not-taken paths where GCC has placed unconditional
jmp
s to the end of the function, those jumps will be statically predicted as taken (i.e. not skipped).Intel also says:
So conditional branches which jump backwards in the code are predicted to be taken, by the static prediction algorithm.
According to Agner Fog, most Pentiums also follow this algorithm:
However, the Core 2 family (and Pentium M) has a completely different policy:
As do AMD processors apparently:
There is one additional factor to consider: CPUs generally like to execute in a linear fashion, so even correctly-predicted taken branches are often more expensive than correctly-predicted not-taken branches.
gcc is right to not generate the prefix, as they have no effect for all processors since the Pentium 4.
But
__builtin_expect
has other effects, like moving a not expected code path away from the cache-hot locations in the code or inlining decisions, so it is still useful.These instruction prefixes have no effect on modern processors (anything newer than Pentium 4). They just cost one byte of code space, and thus, not generating them is the right thing.
For details, see Agner Fog's optimization manuals, in particular 3. Microarchitecture: http://www.agner.org/optimize/
The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" no longer mentions them in the section about optimizing branches (section 3.4.1): http://www.intel.de/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf
These prefixes are a (harmless) relict of the Netburst architecture. In all-out optimization, you can use them to align code, but that's all they're good for nowadays.