I'm compiling a C++ library which defines a single function that randomly samples from a set of data points. The data points are stored in a std::vector
. There are 126,272 std::vector
push_back statements, where the vector in question is of type double
. It is taking a long time to compile.
Why would this take so long? (All the code other than the std::vector
push_back statements would take less than 1 second to compile, because there's very little other code.)
This question was completely answered by the great answer from osgx.
Maybe one additional aspect:
push_back()
vs initialization listWhen running the above test with 100000 push_backs, I get the following result with a gcc 4.4.6 on a Debian 6.0.6 system:
When using an initialization list, it is much faster:
This is about 30+ times faster!
There is the
-ftime-report
option in gcc which prints the detailed report of time wasted by each compiler phase.I'm used ubuntu 12.04 64-bit with gcc 4.6.3 and this code to reproduce your situation:
There are
-ftime-report
outputs for various N (wall
time was inaccurate because of background load on the PC, so look onuser time
,usr
):N=10000
N=100000
So, there is some extra work for huge N in "expand vars" phase. This phase is exactly in this line: cfgexpand.c:4463 (between TV_VAR_EXPAND macro).
The interesting fact: I have very short compilation times with my custom compiled 32-bit g++ 4.6.2 (~20 sec for N = 100000).
What is the difference between my g++ and ubuntu g++? The one is turning on by default the Gcc Stack Protection (
-fstack-protect
option) in Ubuntu. And this protection is added just to "expand vars" phase (found in the sources cfgexpand.c:1644,expand_used_vars(); mentioned here):N=100000, stack protector disabled with option
-fno-stack-protector
(use it for your code):Running time is 24 seconds, down from 800.
UPDATE:
After starting gcc inside
callgrind
(call-graph profiling tool from Valgrind), I can say that there are N stack variables. If stack protector is enabled, they are handled in "expand vars" phase with three O(N^2) algorithms. Actually there are N^2 successful conflict detections and 1,5 * N^2 bit manipulations done plus some nested loop logic.Why number of stack variables is so high? Because every double constant in your code is saved to different slot in the stack. Then it is loaded from its slot and passed as calling convention says (via top of stack in x86; via registers in x86_64). The funny fact: all of code with
push_back
s compiled with-fstack-protector
or with-fno-stack-protector
is the same; stack layout of constants is same too. Only some stack layout offsets of non-push_back code are affected (checked two runs with-S
anddiff -u
). No additional code was created by enabled stack protector.Enabling of stack protector fatally changes some behaviour inside the compiler. Can't say where exactly (note: it is possible to find this turning point with comparing of stack traces with callgraph.tar.gz by Juan M. Bello Rivas).
First big N*(N+1)/2 = O(N^2) walk is in
expand_used_vars_for_block (tree block, level)
function to set info about conflicts between pairs of stack variables:The
add_stack_var_conflict(i,j)
turns toThere is second N^2 walk in
add_alias_set_conflicts
. It does type checks for every pair withobjects_must_conflict_p
. It checks, if two variables are of the same type (most pairs are; this is Type-based alias analysis, TBAA). If not,add_stack_var_conflict
is called; there is only N such calls from this N^2 loop nest.Last huge walk is in
partition_stack_vars()
function withqsort
ing of stack vars ( O(NlogN) ) and N*(N-1)/2 = O(N^2) walk to find all non-conflicting pairs. Here is pseudocode ofpartition_stack_vars
from cfgexpand.c file:Function
stack_var_conflict_p
just checks is there conflict bitmask in some i-th variable and is there j-th bit set as conflict flag with j-th variable (with call tobitmap_bit_p(i->conflict_mask,j)
). The really bad news here is, that callgrind says that every conflict check was successful, and the UNION logic is skipped for every pair.So, a lot of time is wasted by O(N^2) bit sets and O(N^2/2) bit checks; and all this work don't help to optimize anything. And yes, this all is part of
-O0
and triggered by-fstack-protector
.UPDATE2:
Seems, the turning point is
expand_one_var
cfgexpand.c from 4.6, in the check for immediate or deferred allocating of variable on the stack:(expand_one_stack_var was called here only in fast variant, according to callgrind)
Deferred allocation is forced when
-fstack-protect
is enabled (sometimes it needs to reorder all stack variables). There even a comment about some "quadratic problem", which is looks too familiar to us now:(stack allocation is deferred too at
-O2
and greater)Here is a commit: http://gcc.gnu.org/ml/gcc-patches/2005-05/txt00029.txt which added this logic.
I believe the long time relates to vector being a template. The compiler needs to rewrite every occurance of
push_back
with the corresponding function. It's like having many overloaded functions, where the compile needs to do name mangling to address the correct function. This is an extra work compared to simply compiling non-overloaded functions.