I was discussing the merits of "modern" languages compared to C++ with some friends recently, when the following came up (I think inspired by Java):
Does any C++ compiler optimize dynamic dispatch out of a loop? If not, is there any kind of construction that would allow the author to force (or strongly encourage) such an optimization?
Here's the example. Suppose I have a polymorphic hierarchy:
struct A { virtual int f() { return 0; } };
struct B : A { virtual int f() { return /* something complicated */; } /*...*/ };
Now I have a loop that accumulates f()
:
int acc(const A * p, unsigned int N)
{
int result = 0;
for (unsigned int i = 0; i != N; ++i)
result += p->f(); // #1
return result;
}
In this function, the dynamic dispatch p->f()
appears to happen during every round of the loop. However, the ultimate target of the dispatch blatantly (?) cannot vary.
Question: Does this dynamic dispatch get optimized by the compiler? If not, is there any way to rewrite the code to force this optimization, or at least enable the compiler to recognize this? Is there any good test code that can tell me quickly whether this is getting optimized already?
I'm interested in both language and implementation answers, such as "this is impossible according to the standard", or "yes, MSVC does this with option /xyzzy
".
Some comparative remarks: Apparently Java does optimize and even inline the call in the inner loop if appropriate. Objective-C++ apparently allows you to query the dynamic function pointer and store it.
Clarification: The main use case which I'm interested in is when the base class and the function with the loop (like the accumulator) are part of a separate translation unit or library, and there is no control over or knowledge of the derived classes.
I compiled the above code:
The only change I made was to make the methods const as the parameter 'p' to acc() was also const.
When I compiled it (on a macbook) using g++ 4.2.1 and -O3 I get the following code (this looks like the loop in acc()).
Does not look like it is chaining through the lookup table.
It is a simple get via a register that already has vtable set up.
57 L9:
58 movq (%r12), %rax // Get the location of f() method address via the r12 register
59 movq %r12, %rdi // Set up rdi register as `this` (for after call)
60 call *(%rax) // Call the F() method. address is in memory pointed at by rax
61 addl %eax, %r14d
62 incl %ebx
63 cmpl %r13d, %ebx
64 jne L9
If I remove the virtual descriptors from the lines the same code is:
76 L16:
77 movq %r14, %rdi // Set up rdi register as `this` (for after call)
78 call __ZNK1A1fEv // Call the F() method.
79 addl %eax, %r13d
80 incl %ebx
81 cmpl %r12d, %ebx
82 jne L16
So the difference in the above code is really:
movq (%r12), %rax This is a register to register copy.
The cost of this is practically nothing and you could never
detect it. No matter how many times you called the function.
call *(%rax) Here we have to look up the address to call by getting it
from memory. Now this could be expensive.
But in reality is not. The first time this is called the
memory will be placed in an in-chip memory cache (if it is
not there you will get a processor stall while it is loaded
from memory (or the next cache up)) but after that it will
be really fast.
But it is not quite as fast as just calling the address (for
the non virtual version). But the difference is insignificant
and other factors in the code will drown out any gains or
just in pure noise of the measurements.
So to answer the question. No the address of the function is not cached for re-use. It is looked up each time through the loop.
Source that was compiled:
#include <iostream>
struct A { virtual int f() const { return 0; } };
struct B : A { virtual int f() const { return 1; }};
int acc(const A * p, unsigned int N)
{
int result = 0;
for (unsigned int i = 0; i != N; ++i)
result += p->f(); // #1
return result;
}
int main()
{
A a;
B b;
std::cout << acc(&a, 20) << "\n";
std::cout << acc(&b, 22) << "\n";
}
Full Assembley:
1 .mod_init_func
2 .align 3
3 .quad __GLOBAL__I__Z3accPK1Aj
4 .section __TEXT,__textcoal_nt,coalesced,pure_instructions
5 .align 1
6 .align 4
7 .globl __ZNK1A1fEv
8 .weak_definition __ZNK1A1fEv
9 __ZNK1A1fEv:
10 LFB1477:
11 pushq %rbp
12 LCFI0:
13 movq %rsp, %rbp
14 LCFI1:
15 xorl %eax, %eax
16 leave
17 ret
18 LFE1477:
19 .align 1
20 .align 4
21 .globl __ZNK1B1fEv
22 .weak_definition __ZNK1B1fEv
23 __ZNK1B1fEv:
24 LFB1478:
25 pushq %rbp
26 LCFI2:
27 movq %rsp, %rbp
28 LCFI3:
29 movl $1, %eax
30 leave
31 ret
32 LFE1478:
33 .text
34 .align 4,0x90
35 .globl __Z3accPK1Aj
36 __Z3accPK1Aj:
37 LFB1479:
38 pushq %rbp
39 LCFI4:
40 movq %rsp, %rbp
41 LCFI5:
42 pushq %r14
43 LCFI6:
44 pushq %r13
45 LCFI7:
46 pushq %r12
47 LCFI8:
48 pushq %rbx
49 LCFI9:
50 movq %rdi, %r12
51 movl %esi, %r13d
52 xorl %r14d, %r14d
53 testl %esi, %esi
54 je L8
55 xorl %ebx, %ebx
56 .align 4,0x90
57 L9:
58 movq (%r12), %rax
59 movq %r12, %rdi
60 call *(%rax)
61 addl %eax, %r14d
62 incl %ebx
63 cmpl %r13d, %ebx
64 jne L9
65 L8:
66 movl %r14d, %eax
67 popq %rbx
68 popq %r12
69 popq %r13
70 popq %r14
71 leave
72 ret
73 LFE1479:
74 .section __TEXT,__StaticInit,regular,pure_instructions
75 .align 4
76 __Z41__static_initialization_and_destruction_0ii:
77 LFB1649:
78 pushq %rbp
79 LCFI10:
80 movq %rsp, %rbp
81 LCFI11:
82 decl %edi
83 je L18
84 L17:
85 leave
86 ret
87 .align 4
88 L18:
89 cmpl $65535, %esi
90 jne L17
91 leaq __ZStL8__ioinit(%rip), %rdi
92 call __ZNSt8ios_base4InitC1Ev
93 movq ___dso_handle@GOTPCREL(%rip), %rdx
94 xorl %esi, %esi
95 leaq ___tcf_0(%rip), %rdi
96 leave
97 jmp ___cxa_atexit
98 LFE1649:
99 .align 4
100 __GLOBAL__I__Z3accPK1Aj:
101 LFB1651:
102 pushq %rbp
103 LCFI12:
104 movq %rsp, %rbp
105 LCFI13:
106 movl $65535, %esi
107 movl $1, %edi
108 leave
109 jmp __Z41__static_initialization_and_destruction_0ii
110 LFE1651:
111 .text
112 .align 4,0x90
113 ___tcf_0:
114 LFB1650:
115 pushq %rbp
116 LCFI14:
117 movq %rsp, %rbp
118 LCFI15:
119 leaq __ZStL8__ioinit(%rip), %rdi
120 leave
121 jmp __ZNSt8ios_base4InitD1Ev
122 LFE1650:
123 .cstring
124 LC0:
125 .ascii "\12\0"
126 .text
127 .align 4,0x90
128 .globl _main
129 _main:
130 LFB1480:
131 pushq %rbp
132 LCFI16:
133 movq %rsp, %rbp
134 LCFI17:
135 pushq %r14
136 LCFI18:
137 pushq %r13
138 LCFI19:
139 pushq %r12
140 LCFI20:
141 pushq %rbx
142 LCFI21:
143 subq $32, %rsp
144 LCFI22:
145 movq __ZTV1A@GOTPCREL(%rip), %rax
146 addq $16, %rax
147 movq %rax, -48(%rbp)
148 movq __ZTV1B@GOTPCREL(%rip), %rax
149 addq $16, %rax
150 movq %rax, -64(%rbp)
151 leaq -48(%rbp), %r13
152 movq %r13, %rdi
153 call __ZNK1A1fEv
154 movl %eax, %ebx
155 movl $1, %r12d
156 .align 4,0x90
157 L24:
158 movq %r13, %rdi
159 call __ZNK1A1fEv
160 addl %eax, %ebx
161 incl %r12d
162 cmpl $20, %r12d
163 jne L24
164 movl %ebx, %esi
165 movq __ZSt4cout@GOTPCREL(%rip), %r14
166 movq %r14, %rdi
167 call __ZNSolsEi
168 movq %rax, %rdi
169 movl $1, %edx
170 leaq LC0(%rip), %rsi
171 call __ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l
172 leaq -64(%rbp), %r13
173 movq %r13, %rdi
174 movq -64(%rbp), %rax
175 call *(%rax)
176 movl %eax, %ebx
177 movb $1, %r12b
178 .align 4,0x90
179 L26:
180 movq %r13, %rdi
181 movq -64(%rbp), %rax
182 call *(%rax)
183 addl %eax, %ebx
184 incl %r12d
185 cmpl $22, %r12d
186 jne L26
187 movl %ebx, %esi
188 movq %r14, %rdi
189 call __ZNSolsEi
190 movq %rax, %rdi
191 movl $1, %edx
192 leaq LC0(%rip), %rsi
193 call __ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l
194 xorl %eax, %eax
195 addq $32, %rsp
196 popq %rbx
197 popq %r12
198 popq %r13
199 popq %r14
200 leave
201 ret
202 LFE1480:
203 .lcomm __ZStL8__ioinit,1,0
204 .globl __ZTV1A
205 .weak_definition __ZTV1A
206 .section __DATA,__const_coal,coalesced
207 .align 4
208 __ZTV1A:
209 .quad 0
210 .quad __ZTI1A
211 .quad __ZNK1A1fEv
212 .globl __ZTI1A
213 .weak_definition __ZTI1A
214 .align 4
215 __ZTI1A:
216 .quad __ZTVN10__cxxabiv117__class_type_infoE+16
217 .quad __ZTS1A
218 .globl __ZTS1A
219 .weak_definition __ZTS1A
220 .section __TEXT,__const_coal,coalesced
221 __ZTS1A:
222 .ascii "1A\0"
223 .globl __ZTV1B
224 .weak_definition __ZTV1B
225 .section __DATA,__const_coal,coalesced
226 .align 4
227 __ZTV1B:
228 .quad 0
229 .quad __ZTI1B
230 .quad __ZNK1B1fEv
231 .globl __ZTI1B
232 .weak_definition __ZTI1B
233 .align 4
234 __ZTI1B:
235 .quad __ZTVN10__cxxabiv120__si_class_type_infoE+16
236 .quad __ZTS1B
237 .quad __ZTI1A
238 .globl __ZTS1B
239 .weak_definition __ZTS1B
240 .section __TEXT,__const_coal,coalesced
241 __ZTS1B:
242 .ascii "1B\0"
243 .section __TEXT,__eh_frame,coalesced,no_toc+strip_static_syms+live_support
244 EH_frame1:
245 .set L$set$0,LECIE1-LSCIE1
246 .long L$set$0
247 LSCIE1:
248 .long 0x0
249 .byte 0x1
250 .ascii "zPR\0"
251 .byte 0x1
252 .byte 0x78
253 .byte 0x10
254 .byte 0x6
255 .byte 0x9b
256 .long ___gxx_personality_v0+4@GOTPCREL
257 .byte 0x10
258 .byte 0xc
259 .byte 0x7
260 .byte 0x8
261 .byte 0x90
262 .byte 0x1
263 .align 3
264 LECIE1:
265 .globl __ZNK1A1fEv.eh
266 .weak_definition __ZNK1A1fEv.eh
267 __ZNK1A1fEv.eh:
268 LSFDE1:
269 .set L$set$1,LEFDE1-LASFDE1
270 .long L$set$1
271 LASFDE1:
272 .long LASFDE1-EH_frame1
273 .quad LFB1477-.
274 .set L$set$2,LFE1477-LFB1477
275 .quad L$set$2
276 .byte 0x0
277 .byte 0x4
278 .set L$set$3,LCFI0-LFB1477
279 .long L$set$3
280 .byte 0xe
281 .byte 0x10
282 .byte 0x86
283 .byte 0x2
284 .byte 0x4
285 .set L$set$4,LCFI1-LCFI0
286 .long L$set$4
287 .byte 0xd
288 .byte 0x6
289 .align 3
290 LEFDE1:
291 .globl __ZNK1B1fEv.eh
292 .weak_definition __ZNK1B1fEv.eh
293 __ZNK1B1fEv.eh:
294 LSFDE3:
295 .set L$set$5,LEFDE3-LASFDE3
296 .long L$set$5
297 LASFDE3:
298 .long LASFDE3-EH_frame1
299 .quad LFB1478-.
300 .set L$set$6,LFE1478-LFB1478
301 .quad L$set$6
302 .byte 0x0
303 .byte 0x4
304 .set L$set$7,LCFI2-LFB1478
305 .long L$set$7
306 .byte 0xe
307 .byte 0x10
308 .byte 0x86
309 .byte 0x2
310 .byte 0x4
311 .set L$set$8,LCFI3-LCFI2
312 .long L$set$8
313 .byte 0xd
314 .byte 0x6
315 .align 3
316 LEFDE3:
317 .globl __Z3accPK1Aj.eh
318 __Z3accPK1Aj.eh:
319 LSFDE5:
320 .set L$set$9,LEFDE5-LASFDE5
321 .long L$set$9
322 LASFDE5:
323 .long LASFDE5-EH_frame1
324 .quad LFB1479-.
325 .set L$set$10,LFE1479-LFB1479
326 .quad L$set$10
327 .byte 0x0
328 .byte 0x4
329 .set L$set$11,LCFI4-LFB1479
330 .long L$set$11
331 .byte 0xe
332 .byte 0x10
333 .byte 0x86
334 .byte 0x2
335 .byte 0x4
336 .set L$set$12,LCFI5-LCFI4
337 .long L$set$12
338 .byte 0xd
339 .byte 0x6
340 .byte 0x4
341 .set L$set$13,LCFI9-LCFI5
342 .long L$set$13
343 .byte 0x83
344 .byte 0x6
345 .byte 0x8c
346 .byte 0x5
347 .byte 0x8d
348 .byte 0x4
349 .byte 0x8e
350 .byte 0x3
351 .align 3
352 LEFDE5:
353 __Z41__static_initialization_and_destruction_0ii.eh:
354 LSFDE7:
355 .set L$set$14,LEFDE7-LASFDE7
356 .long L$set$14
357 LASFDE7:
358 .long LASFDE7-EH_frame1
359 .quad LFB1649-.
360 .set L$set$15,LFE1649-LFB1649
361 .quad L$set$15
362 .byte 0x0
363 .byte 0x4
364 .set L$set$16,LCFI10-LFB1649
365 .long L$set$16
366 .byte 0xe
367 .byte 0x10
368 .byte 0x86
369 .byte 0x2
370 .byte 0x4
371 .set L$set$17,LCFI11-LCFI10
372 .long L$set$17
373 .byte 0xd
374 .byte 0x6
375 .align 3
376 LEFDE7:
377 __GLOBAL__I__Z3accPK1Aj.eh:
378 LSFDE9:
379 .set L$set$18,LEFDE9-LASFDE9
380 .long L$set$18
381 LASFDE9:
382 .long LASFDE9-EH_frame1
383 .quad LFB1651-.
384 .set L$set$19,LFE1651-LFB1651
385 .quad L$set$19
386 .byte 0x0
387 .byte 0x4
388 .set L$set$20,LCFI12-LFB1651
389 .long L$set$20
390 .byte 0xe
391 .byte 0x10
392 .byte 0x86
393 .byte 0x2
394 .byte 0x4
395 .set L$set$21,LCFI13-LCFI12
396 .long L$set$21
397 .byte 0xd
398 .byte 0x6
399 .align 3
400 LEFDE9:
401 ___tcf_0.eh:
402 LSFDE11:
403 .set L$set$22,LEFDE11-LASFDE11
404 .long L$set$22
405 LASFDE11:
406 .long LASFDE11-EH_frame1
407 .quad LFB1650-.
408 .set L$set$23,LFE1650-LFB1650
409 .quad L$set$23
410 .byte 0x0
411 .byte 0x4
412 .set L$set$24,LCFI14-LFB1650
413 .long L$set$24
414 .byte 0xe
415 .byte 0x10
416 .byte 0x86
417 .byte 0x2
418 .byte 0x4
419 .set L$set$25,LCFI15-LCFI14
420 .long L$set$25
421 .byte 0xd
422 .byte 0x6
423 .align 3
424 LEFDE11:
425 .globl _main.eh
426 _main.eh:
427 LSFDE13:
428 .set L$set$26,LEFDE13-LASFDE13
429 .long L$set$26
430 LASFDE13:
431 .long LASFDE13-EH_frame1
432 .quad LFB1480-.
433 .set L$set$27,LFE1480-LFB1480
434 .quad L$set$27
435 .byte 0x0
436 .byte 0x4
437 .set L$set$28,LCFI16-LFB1480
438 .long L$set$28
439 .byte 0xe
440 .byte 0x10
441 .byte 0x86
442 .byte 0x2
443 .byte 0x4
444 .set L$set$29,LCFI17-LCFI16
445 .long L$set$29
446 .byte 0xd
447 .byte 0x6
448 .byte 0x4
449 .set L$set$30,LCFI22-LCFI17
450 .long L$set$30
451 .byte 0x83
452 .byte 0x6
453 .byte 0x8c
454 .byte 0x5
455 .byte 0x8d
456 .byte 0x4
457 .byte 0x8e
458 .byte 0x3
459 .align 3
460 LEFDE13:
461 .constructor
462 .destructor
463 .align 1
464 .subsections_via_symbols
If you're interested in this kind of thing, check out Agner Fog's excellent Software Optimization Manuals. This question is tangentially addressed in the first of the five, Optimizing C++ (pdf) (the others are all about assembly - he's kind of old-school).
If f()
is a const
function, or its return value when called on p
is otherwise guaranteed to be unchanged, it can be pulled out of the loop and only calculated once (see "Loop Invariant Code Motion", page 70). Most compilers will do this (see "Comparison of Different Compilers", page 74).
If that can't be done, then it might still be possible to devirtualize. But this can't be done in a callable function, because that would have to use a virtual lookup for the sake of correctness. But if the function was inlined, and the type of p
was known in the calling scope, it could be done. The calling code would have to look something like this:
A* aptr = new A(42); // <- The compiler knows exactly what type aptr points to
acc(a, 100); // <- This would have to be inlined!
But according to that table (page 74), only the GCC compilers make this optimization.
Finally, the closest optimization (I think) to what you're asking. Could the compiler perform the virtual lookup once, store a function pointer, and then use that function pointer to avoid the virtual lookup inside the loop? I don't see why not. But I don't know if any compilers do so - it's an obscure enough optimization that it's not even mentioned in Agner Fog's compulsively detailed C++ manual.
For what it's worth, here's what he has to say about function pointers (page 38):
Calling a function through a function pointer typically takes a few
clock cycles more than calling the function directly if the target
address can be predicted. The target address is predicted if the value
of the function pointer is the same as last time the statement was
executed. If the value of the function pointer has changed then the
target address is likely to be mispredicted, which causes a long
delay. See page 44 about branch prediction. A Pentium M processor may
be able to predict the target if the changes of the function pointer
follows a simple regular pattern, while Pentium 4 and AMD processors
are sure to make a misprediction every time the function pointer has
changed.
And an excerpt about virtual member functions (page 54):
The time it takes to call a virtual member function is a few clock
cycles more than it takes to call a non-virtual member function,
provided that the function call statement always calls the same
version of the virtual function. If the version changes then you may
get a misprediction penalty of 10 - 20 clock cycles. The rules for
prediction and misprediction of virtual function calls is the same as
for switch statements, as explained on page 45.
The dispatching mechanism can be bypassed when the virtual function is
called on an object of known type, but you cannot always rely on the
compiler bypassing the dispatch mechanism even when it would be
obvious to do so. See page 73.
You know the function pointer wouldn't change in your example, so you wouldn't get the misprediction penalty, but he never compares function pointer performance to virtual function performance directly. Both just take "a few" more clock cycles than a regular function call. Maybe it's the same mechanism - if so, that "optimization" would just be adding an extra lookup.
So it's hard to say, really. The best way to get an answer might just be to have your favourite compiler spit out some optimized assembly and dig through it (unpleasant, but conclusive!).
Hope this helps!
It has been pointed out to me that GCC has an extension, called "bound member functions", that does indeed allow you to store the actual function pointer. Demo:
struct Foo
{
virtual ~Foo() { }
virtual int f(int, int) = 0;
};
void f(Foo & x)
{
using gcc_func_type = int (*)(Foo *, int, int);
gcc_func_type fp = (gcc_func_type)(x.*&Foo::f); // !
for ( /* ... */ )
{
int result = fp(&x, 10, 20); // no virtual dispatch!
}
}
The syntax requires that you go through a pointer-to-member indirection (i.e. you cannot just write (x.f)
), and the cast must be a C-style cast. The resulting function pointer has the type of a pointer to a free function, with the instance argument taken as the first parameter.
Here's the required template version:
struct A { int f() const { return 0; } };
template<class T>
struct B { B(T &t) : t(t) { } int f() const { return t.f()+1; } T &t; };
template<class T>
int acc(const T *p, unsigned int N)
{
int result = 0;
for(unsigned int i = 0; i != N; ++i)
result += p->f();
return result;
}
And usage is:
int main() {
A a;
B<A> obj(a);
int result = acc(&obj, 10);
}