I have a C++ application, running on Linux, which I'm in the process of optimizing. How can I pinpoint which areas of my code are running slowly?
问题:
回答1:
If your goal is to use a profiler, use one of the suggested ones.
However, if you're in a hurry and you can manually interrupt your program under the debugger while it's being subjectively slow, there's a simple way to find performance problems.
Just halt it several times, and each time look at the call stack. If there is some code that is wasting some percentage of the time, 20% or 50% or whatever, that is the probability that you will catch it in the act on each sample. So that is roughly the percentage of samples on which you will see it. There is no educated guesswork required. If you do have a guess as to what the problem is, this will prove or disprove it.
You may have multiple performance problems of different sizes. If you clean out any one of them, the remaining ones will take a larger percentage, and be easier to spot, on subsequent passes. This magnification effect, when compounded over multiple problems, can lead to truly massive speedup factors.
Caveat: Programmers tend to be skeptical of this technique unless they've used it themselves. They will say that profilers give you this information, but that is only true if they sample the entire call stack, and then let you examine a random set of samples. (The summaries are where the insight is lost.) Call graphs don't give you the same information, because
- they don't summarize at the instruction level, and
- they give confusing summaries in the presence of recursion.
They will also say it only works on toy programs, when actually it works on any program, and it seems to work better on bigger programs, because they tend to have more problems to find. They will say it sometimes finds things that aren't problems, but that is only true if you see something once. If you see a problem on more than one sample, it is real.
P.S. This can also be done on multi-thread programs if there is a way to collect call-stack samples of the thread pool at a point in time, as there is in Java.
P.P.S As a rough generality, the more layers of abstraction you have in your software, the more likely you are to find that that is the cause of performance problems (and the opportunity to get speedup).
Added: It might not be obvious, but the stack sampling technique works equally well in the presence of recursion. The reason is that the time that would be saved by removal of an instruction is approximated by the fraction of samples containing it, regardless of the number of times it may occur within a sample.
Another objection I often hear is: "It will stop someplace random, and it will miss the real problem". This comes from having a prior concept of what the real problem is. A key property of performance problems is that they defy expectations. Sampling tells you something is a problem, and your first reaction is disbelief. That is natural, but you can be sure if it finds a problem it is real, and vice-versa.
ADDED: Let me make a Bayesian explanation of how it works. Suppose there is some instruction I
(call or otherwise) which is on the call stack some fraction f
of the time (and thus costs that much). For simplicity, suppose we don't know what f
is, but assume it is either 0.1, 0.2, 0.3, ... 0.9, 1.0, and the prior probability of each of these possibilities is 0.1, so all of these costs are equally likely a-priori.
Then suppose we take just 2 stack samples, and we see instruction I
on both samples, designated observation o=2/2
. This gives us new estimates of the frequency f
of I
, according to this:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&&f=x) P(o=2/2&&f >= x) P(f >= x | o=2/2)
0.1 1 1 0.1 0.1 0.25974026
0.1 0.9 0.81 0.081 0.181 0.47012987
0.1 0.8 0.64 0.064 0.245 0.636363636
0.1 0.7 0.49 0.049 0.294 0.763636364
0.1 0.6 0.36 0.036 0.33 0.857142857
0.1 0.5 0.25 0.025 0.355 0.922077922
0.1 0.4 0.16 0.016 0.371 0.963636364
0.1 0.3 0.09 0.009 0.38 0.987012987
0.1 0.2 0.04 0.004 0.384 0.997402597
0.1 0.1 0.01 0.001 0.385 1
P(o=2/2) 0.385
The last column says that, for example, the probability that f
>= 0.5 is 92%, up from the prior assumption of 60%.
Suppose the prior assumptions are different. Suppose we assume P(f=0.1) is .991 (nearly certain), and all the other possibilities are almost impossible (0.001). In other words, our prior certainty is that I
is cheap. Then we get:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&& f=x) P(o=2/2&&f >= x) P(f >= x | o=2/2)
0.001 1 1 0.001 0.001 0.072727273
0.001 0.9 0.81 0.00081 0.00181 0.131636364
0.001 0.8 0.64 0.00064 0.00245 0.178181818
0.001 0.7 0.49 0.00049 0.00294 0.213818182
0.001 0.6 0.36 0.00036 0.0033 0.24
0.001 0.5 0.25 0.00025 0.00355 0.258181818
0.001 0.4 0.16 0.00016 0.00371 0.269818182
0.001 0.3 0.09 0.00009 0.0038 0.276363636
0.001 0.2 0.04 0.00004 0.00384 0.279272727
0.991 0.1 0.01 0.00991 0.01375 1
P(o=2/2) 0.01375
Now it says P(f >= 0.5) is 26%, up from the prior assumption of 0.6%. So Bayes allows us to update our estimate of the probable cost of I
. If the amount of data is small, it doesn't tell us accurately what the cost is, only that it is big enough to be worth fixing.
Yet another way to look at it is called the Rule Of Succession. If you flip a coin 2 times, and it comes up heads both times, what does that tell you about the probable weighting of the coin? The respected way to answer is to say that it's a Beta distribution, with average value (number of hits + 1) / (number of tries + 2) = (2+1)/(2+2) = 75%.
(The key is that we see I
more than once. If we only see it once, that doesn't tell us much except that f
> 0.)
So, even a very small number of samples can tell us a lot about the cost of instructions that it sees. (And it will see them with a frequency, on average, proportional to their cost. If n
samples are taken, and f
is the cost, then I
will appear on nf+/-sqrt(nf(1-f))
samples. Example, n=10
, f=0.3
, that is 3+/-1.4
samples.)
ADDED, to give an intuitive feel for the difference between measuring and random stack sampling:
There are profilers now that sample the stack, even on wall-clock time, but what comes out is measurements (or hot path, or hot spot, from which a "bottleneck" can easily hide). What they don't show you (and they easily could) is the actual samples themselves. And if your goal is to find the bottleneck, the number of them you need to see is, on average, 2 divided by the fraction of time it takes.
So if it takes 30% of time, 2/.3 = 6.7 samples, on average, will show it, and the chance that 20 samples will show it is 99.2%.
Here is an off-the-cuff illustration of the difference between examining measurements and examining stack samples. The bottleneck could be one big blob like this, or numerous small ones, it makes no difference.
Measurement is horizontal; it tells you what fraction of time specific routines take. Sampling is vertical. If there is any way to avoid what the whole program is doing at that moment, and if you see it on a second sample, you've found the bottleneck. That's what makes the difference - seeing the whole reason for the time being spent, not just how much.
回答2:
You can use Valgrind with the following options
valgrind --tool=callgrind ./(Your binary)
It will generate a file called callgrind.out.x
. You can then use kcachegrind
tool to read this file. It will give you a graphical analysis of things with results like which lines cost how much.
回答3:
I assume you're using GCC. The standard solution would be to profile with gprof.
Be sure to add -pg
to compilation before profiling:
cc -o myprog myprog.c utils.c -g -pg
I haven't tried it yet but I've heard good things about google-perftools. It is definitely worth a try.
Related question here.
A few other buzzwords if gprof
does not do the job for you: Valgrind, Intel VTune, Sun DTrace.
回答4:
Newer kernels (e.g. the latest Ubuntu kernels) come with the new 'perf' tools (apt-get install linux-tools
) AKA perf_events.
These come with classic sampling profilers (man-page) as well as the awesome timechart!
The important thing is that these tools can be system profiling and not just process profiling - they can show the interaction between threads, processes and the kernel and let you understand the scheduling and I/O dependencies between processes.
回答5:
I would use Valgrind and Callgrind as a base for my profiling tool suite. What is important to know is that Valgrind is basically a Virtual Machine:
(wikipedia) Valgrind is in essence a virtual machine using just-in-time (JIT) compilation techniques, including dynamic recompilation. Nothing from the original program ever gets run directly on the host processor. Instead, Valgrind first translates the program into a temporary, simpler form called Intermediate Representation (IR), which is a processor-neutral, SSA-based form. After the conversion, a tool (see below) is free to do whatever transformations it would like on the IR, before Valgrind translates the IR back into machine code and lets the host processor run it.
Callgrind is a profiler build upon that. Main benefit is that you don't have to run your aplication for hours to get reliable result. Even one second run is sufficient to get rock-solid, reliable results, because Callgrind is a non-probing profiler.
Another tool build upon Valgrind is Massif. I use it to profile heap memory usage. It works great. What it does is that it gives you snapshots of memory usage -- detailed information WHAT holds WHAT percentage of memory, and WHO had put it there. Such information is available at different points of time of application run.
回答6:
The answer to run valgrind --tool=callgrind
is not quite complete without some options. We usually do not want to profile 10 minutes of slow startup time under Valgrind and want to profile our program when it is doing some task.
So this is what I recommend. Run program first:
valgrind --tool=callgrind --dump-instr=yes -v --instr-atstart=no ./binary > tmp
Now when it works and we want to start profiling we should run in another window:
callgrind_control -i on
This turns profiling on. To turn it off and stop whole task we might use:
callgrind_control -k
Now we have some files named callgrind.out.* in current directory. To see profiling results use:
kcachegrind callgrind.out.*
I recommend in next window to click on "Self" column header, otherwise it shows that "main()" is most time consuming task. "Self" shows how much each function itself took time, not together with dependents.
回答7:
This is a response to Nazgob's Gprof answer.
I've been using Gprof the last couple of days and have already found three significant limitations, one of which I've not seen documented anywhere else (yet):
It doesn't work properly on multi-threaded code, unless you use a workaround
The call graph gets confused by function pointers. Example: I have a function called
multithread()
which enables me to multi-thread a specified function over a specified array (both passed as arguments). Gprof however, views all calls tomultithread()
as equivalent for the purposes of computing time spent in children. Since some functions I pass tomultithread()
take much longer than others my call graphs are mostly useless. (To those wondering if threading is the issue here: no,multithread()
can optionally, and did in this case, run everything sequentially on the calling thread only).It says here that "... the number-of-calls figures are derived by counting, not sampling. They are completely accurate...". Yet I find my call graph giving me 5345859132+784984078 as call stats to my most-called function, where the first number is supposed to be direct calls, and the second recursive calls (which are all from itself). Since this implied I had a bug, I put in long (64-bit) counters into the code and did the same run again. My counts: 5345859132 direct, and 78094395406 self-recursive calls. There are a lot of digits there, so I'll point out the recursive calls I measure are 78bn, versus 784m from Gprof: a factor of 100 different. Both runs were single threaded and unoptimised code, one compiled
-g
and the other-pg
.
This was GNU Gprof (GNU Binutils for Debian) 2.18.0.20080103 running under 64-bit Debian Lenny, if that helps anyone.
回答8:
Use Valgrind, callgrind and kcachegrind:
valgrind --tool=callgrind ./(Your binary)
generates callgrind.out.x. Read it using kcachegrind.
Use gprof (add -pg):
cc -o myprog myprog.c utils.c -g -pg
(not so good for multi-threads, function pointers)
Use google-perftools:
Uses time sampling, I/O and CPU bottlenecks are revealed.
Intel VTune is the best (free for educational purposes).
Others: AMD Codeanalyst (since replaced with AMD CodeXL), OProfile, 'perf' tools (apt-get install linux-tools)
回答9:
For single-threaded programs you can use igprof, The Ignominous Profiler: https://igprof.org/ .
It is a sampling profiler, along the lines of the... long... answer by Mike Dunlavey, which will gift wrap the results in a browsable call stack tree, annotated with the time or memory spent in each function, either cumulative or per-function.
回答10:
These are the two methods I use for speeding up my code:
For CPU bound applications:
- Use a profiler in DEBUG mode to identify questionable parts of your code
- Then switch to RELEASE mode and comment out the questionable sections of your code (stub it with nothing) until you see changes in performance.
For I/O bound applications:
- Use a profiler in RELEASE mode to identify questionable parts of your code.
N.B.
If you don't have a profiler, use the poor man's profiler. Hit pause while debugging your application. Most developer suites will break into assembly with commented line numbers. You're statistically likely to land in a region that is eating most of your CPU cycles.
For CPU, the reason for profiling in DEBUG mode is because if your tried profiling in RELEASE mode, the compiler is going to reduce math, vectorize loops, and inline functions which tends to glob your code into an un-mappable mess when it's assembled. An un-mappable mess means your profiler will not be able to clearly identify what is taking so long because the assembly may not correspond to the source code under optimization. If you need the performance (e.g. timing sensitive) of RELEASE mode, disable debugger features as needed to keep a usable performance.
For I/O-bound, the profiler can still identify I/O operations in RELEASE mode because I/O operations are either externally linked to a shared library (most of the time) or in the worst case, will result in a sys-call interrupt vector (which is also easily identifiable by the profiler).
回答11:
Also worth mentioning are
- HPCToolkit (http://hpctoolkit.org/) - Open-source, works for parallel programs and has a GUI with which to look at the results multiple ways
- Intel VTune (https://software.intel.com/en-us/vtune) - If you have intel compilers this is very good
- TAU (http://www.cs.uoregon.edu/research/tau/home.php)
I have used HPCToolkit and VTune and they are very effective at finding the long pole in the tent and do not need your code to be recompiled (except that you have to use -g -O or RelWithDebInfo type build in CMake to get meaningful output). I have heard TAU is similar in capabilities.
回答12:
You can use the iprof library:
https://gitlab.com/Neurochrom/iprof
https://github.com/Neurochrom/iprof
It's cross-platform and allows you not to measure performance of your application also in real-time. You can even couple it with a live graph. Full disclaimer: I am the author.
回答13:
At work we have a really nice tool that helps us monitoring what we want in terms of scheduling. This has been useful numerous times.
It's in C++ and must be customized to your needs. Unfortunately I can't share code, just concepts.
You use a "large" volatile
buffer containing timestamps and event ID that you can dump post mortem or after stopping the logging system (and dump this into a file for example).
You retrieve the so-called large buffer with all the data and a small interface parses it and shows events with name (up/down + value) like an oscilloscope does with colors (configured in .hpp
file).
You customize the amount of events generated to focus solely on what you desire. It helped us a lot for scheduling issues while consuming the amount of CPU we wanted based on the amount of logged events per second.
You need 3 files :
toolname.hpp // interface
toolname.cpp // code
tool_events_id.hpp // Events ID
The concept is to define events in tool_events_id.hpp
like that :
// EVENT_NAME ID BEGIN_END BG_COLOR NAME
#define SOCK_PDU_RECV_D 0x0301 //@D00301 BGEEAAAA # TX_PDU_Recv
#define SOCK_PDU_RECV_F 0x0302 //@F00301 BGEEAAAA # TX_PDU_Recv
You also define a few functions in toolname.hpp
:
#define LOG_LEVEL_ERROR 0
#define LOG_LEVEL_WARN 1
// ...
void init(void);
void probe(id,payload);
// etc
Wherever in you code you can use :
toolname<LOG_LEVEL>::log(EVENT_NAME,VALUE);
The probe
function uses a few assembly lines to retrieve the clock timestamp ASAP and then sets an entry in the buffer. We also have an atomic increment to safely find an index where to store the log event.
Of course buffer is circular.
Hope the idea is not obfuscated by the lack of sample code.
回答14:
You can use a logging framework like loguru
since it includes timestamps and total uptime which can be used nicely for profiling:
回答15:
Actually a bit surprised not many mentioned about google/benchmark , while it is a bit cumbersome to pin the specific area of code, specially if the code base is a little big one, however I found this really helpful when used in combination with callgrind
IMHO identifying the piece that is causing bottleneck is the key here. I'd however try and answer the following questions first and choose tool based on that
- is my algorithm correct ?
- are there locks that are proving to be bottle necks ?
- is there a specific section of code that's proving to be a culprit ?
- how about IO, handled and optimized ?
valgrind
with the combination of callrind
and kcachegrind
should provide a decent estimation on the points above and once it's established that there are issues with some section of code, I'd suggest do a micro bench mark google benchmark
is a good place to start.
回答16:
Use -pg
flag when compiling and linking the code and run the executable file. While this program is executed, profiling data is collected in the file a.out.
There is two different type of profiling
1- Flat profiling:
by running the command gprog --flat-profile a.out
you got the following data
- what percentage of the overall time was spent for the function,
- how many seconds were spent in a function—including and excluding calls to sub-functions,
- the number of calls,
- the average time per call.
2- graph profiling
us the command gprof --graph a.out
to get the following data for each function which includes
- In each section, one function is marked with an index number.
- Above function , there is a list of functions that call the function .
- Below function , there is a list of functions that are called by the function .
To get more info you can look in https://sourceware.org/binutils/docs-2.32/gprof/
回答17:
As no one mentioned Arm MAP, I'd add it as personally I have successfully used Map to profile a C++ scientific program.
Arm MAP is the profiler for parallel, multithreaded or single threaded C, C++, Fortran and F90 codes. It provides in-depth analysis and bottleneck pinpointing to the source line. Unlike most profilers, it's designed to be able to profile pthreads, OpenMP or MPI for parallel and threaded code.
MAP is commercial software.