I was discussing the merits of "modern" languages compared to C++ with some friends recently, when the following came up (I think inspired by Java):
Does any C++ compiler optimize dynamic dispatch out of a loop? If not, is there any kind of construction that would allow the author to force (or strongly encourage) such an optimization?
Here's the example. Suppose I have a polymorphic hierarchy:
struct A { virtual int f() { return 0; } };
struct B : A { virtual int f() { return /* something complicated */; } /*...*/ };
Now I have a loop that accumulates f()
:
int acc(const A * p, unsigned int N)
{
int result = 0;
for (unsigned int i = 0; i != N; ++i)
result += p->f(); // #1
return result;
}
In this function, the dynamic dispatch p->f()
appears to happen during every round of the loop. However, the ultimate target of the dispatch blatantly (?) cannot vary.
Question: Does this dynamic dispatch get optimized by the compiler? If not, is there any way to rewrite the code to force this optimization, or at least enable the compiler to recognize this? Is there any good test code that can tell me quickly whether this is getting optimized already?
I'm interested in both language and implementation answers, such as "this is impossible according to the standard", or "yes, MSVC does this with option /xyzzy
".
Some comparative remarks: Apparently Java does optimize and even inline the call in the inner loop if appropriate. Objective-C++ apparently allows you to query the dynamic function pointer and store it.
Clarification: The main use case which I'm interested in is when the base class and the function with the loop (like the accumulator) are part of a separate translation unit or library, and there is no control over or knowledge of the derived classes.
I compiled the above code:
The only change I made was to make the methods const as the parameter 'p' to acc() was also const.
When I compiled it (on a macbook) using g++ 4.2.1 and -O3 I get the following code (this looks like the loop in acc()).
Does not look like it is chaining through the lookup table.
It is a simple get via a register that already has vtable set up.
If I remove the virtual descriptors from the lines the same code is:
So the difference in the above code is really:
So to answer the question. No the address of the function is not cached for re-use. It is looked up each time through the loop.
Source that was compiled:
Full Assembley:
Here's the required template version:
And usage is:
If you're interested in this kind of thing, check out Agner Fog's excellent Software Optimization Manuals. This question is tangentially addressed in the first of the five, Optimizing C++ (pdf) (the others are all about assembly - he's kind of old-school).
If
f()
is aconst
function, or its return value when called onp
is otherwise guaranteed to be unchanged, it can be pulled out of the loop and only calculated once (see "Loop Invariant Code Motion", page 70). Most compilers will do this (see "Comparison of Different Compilers", page 74).If that can't be done, then it might still be possible to devirtualize. But this can't be done in a callable function, because that would have to use a virtual lookup for the sake of correctness. But if the function was inlined, and the type of
p
was known in the calling scope, it could be done. The calling code would have to look something like this:But according to that table (page 74), only the GCC compilers make this optimization.
Finally, the closest optimization (I think) to what you're asking. Could the compiler perform the virtual lookup once, store a function pointer, and then use that function pointer to avoid the virtual lookup inside the loop? I don't see why not. But I don't know if any compilers do so - it's an obscure enough optimization that it's not even mentioned in Agner Fog's compulsively detailed C++ manual.
For what it's worth, here's what he has to say about function pointers (page 38):
And an excerpt about virtual member functions (page 54):
You know the function pointer wouldn't change in your example, so you wouldn't get the misprediction penalty, but he never compares function pointer performance to virtual function performance directly. Both just take "a few" more clock cycles than a regular function call. Maybe it's the same mechanism - if so, that "optimization" would just be adding an extra lookup.
So it's hard to say, really. The best way to get an answer might just be to have your favourite compiler spit out some optimized assembly and dig through it (unpleasant, but conclusive!).
Hope this helps!
It has been pointed out to me that GCC has an extension, called "bound member functions", that does indeed allow you to store the actual function pointer. Demo:
The syntax requires that you go through a pointer-to-member indirection (i.e. you cannot just write
(x.f)
), and the cast must be a C-style cast. The resulting function pointer has the type of a pointer to a free function, with the instance argument taken as the first parameter.