Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz.
It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction reordering, cache preloading, dependency interleaving, etc.
The downside was that any deviation for the norm was enormously costly. For example, IIRC a certain AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function through a pointer (!) and this was apparently normal.
This is not a negligible "don't worry about it" number! Bear in mind that "good design" normally means "factor your functions as much as possible" and "encode semantics in the data types" which often implies virtual interfaces.
The trade-off is that code which doesn't perform such operations might get more than two instructions per cycle. These are numbers one wants to worry about when writing high-performance C++ code which is heavy on the object design and light on the number crunching.
I understand that the long-CPU-pipeline trend has been reversing as we enter the low-power era. Here's my question:
Does the latest generation of x86-compatible processors still suffer massive penalties for virtual function calls, bad branch predictions, etc?
AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function
Huh.. so large..
There is an "Indirect branch prediction" method, which helps to predict virtual function jump, IF there was the same indirect jump some time ago. There is still a penalty for first and mispredicted virt. function jump.
Support varies from simple "predicted right if and only if the previous indirect branch was exactly the same" to very complex two-level tens or hundreds entries with detecting of periodic alternation of 2-3 target address for single indirect jmp instruction.
There was a lot of evolution here...
http://arstechnica.com/hardware/news/2006/04/core.ars/7
first introduced with the Pentium M: ... indirect branch predictor.
The indirect branch predictor
Because indirect branches load their branch targets from a register, instead of having them immediately available as is the case with direct branches, they're notoriously difficult to predict. Core's indirect branch predictor is a table that stores history information about the preferred target addresses of each indirect branch that the front end encounters. Thus when the front-end encounters an indirect branch and predicts it as taken, it can ask the indirect branch predictor to direct it to the address in the BTB that the branch will probably want.
http://www.realworldtech.com/page.cfm?ArticleID=rwt051607033728&p=3
Indirect branch prediction was first introduced with Intel’s Prescott microarchitecture and later the Pentium M.
between 16-50% of all branch mispredicts were indirect (29% on average). The real value of indirect branch misprediction is for many of the newer scripting or high level languages, such as Ruby, Perl or Python, which use interpreters. Other common indirect branch common culprits include virtual functions (used in C++) and calls to function pointers.
http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p=5
AMD has adopted some of these refinements; for instance adding indirect branch predictor arrays in Barcelona and later processors. However, the K8 has older and less accurate branch predictors than the Core 2.
http://www.agner.org/optimize/microarchitecture.pdf
3.12 Indirect jumps on older processors
Indirect jumps, indirect calls, and returns may go to a different address each time. The
prediction method for an indirect jump or indirect call is, in processors older than PM and
K10, simply to predict that it will go to the same target as last time it was executed.
and the same pdf, page 14
Indirect jump prediction
An indirect jump or call is a control transfer instruction that has more than two possible
targets. A C++ program can generate an indirect jump or call with... a virtual function. An indirect jump or call is generated in assembly by
specifying a register or a memory variable or an indexed array as the destination of a jump
or call instruction. Many processors make only one BTB entry for an indirect jump or call.
This means that it will always be predicted to go to the same target as it did last time.
As object oriented programming with polymorphous classes has become more common,
there is a growing need for predicting indirect calls with multiple targets. This can be done
by assigning a new BTB entry for every new jump target that is encountered. The history
buffer and pattern history table must have space for more than one bit of information for
each jump incident in order to distinguish more than two possible targets.
The PM is the first x86 processor to implement this method. The prediction rule on p. 12 still
applies with the modification that the theoretical maximum period that can be predicted
perfectly is mn, where m is the number of different targets per indirect jump, because there
are mn different possible n-length subsequences. However, this theoretical maximum cannot
be reached if it exceeds the size of the BTB or the pattern history table.
Agner's manual has a longer description of branch predictor in many modern CPUs and the evolution of predictor in cpus of every manufacturer (x86/x86_64).
Also a lot of theoretical "indirect branch prediction" methods (look in the Google scholar); even wiki said some words about it http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps /
For Atoms from the agner's micro:
Prediction of indirect branches
The Atom has no pattern predictor for indirect branches according to my tests. Indirect
branches are predicted to go to the same target as last time.
So, for low power, indirect branch prediction is not so advanced. So does Via Nano:
Indirect jumps are predicted to go to the same target as last time.
I think, that shorter pipeline of lowpower x86 has lower penalty, 7-20 ticks.