When I run gprof on my C program it says no time accumulated for my program and shows 0 time for all function calls. However it does count the function calls.
How do I modify my program so that gprof will be able to count how much time something takes to run?
Did you specify -pg when compiling?
http://sourceware.org/binutils/docs-2.20/gprof/Compiling.html#Compiling
Once it is compiled, you run the program and then run gprof on the binary.
E.g.:
test.c:
#include <stdio.h>
int main ()
{
int i;
for (i = 0; i < 10000; i++) {
printf ("%d\n", i);
}
return 0;
}
Compile as cc -pg test.c
, then run as a.out
, then gprof a.out
, gives me
granularity: each sample hit covers 4 byte(s) for 1.47% of 0.03 seconds
% cumulative self self total
time seconds seconds calls ms/call ms/call name
45.6 0.02 0.02 10000 0.00 0.00 __sys_write [10]
45.6 0.03 0.02 0 100.00% .mcount (26)
2.9 0.03 0.00 20000 0.00 0.00 __sfvwrite [6]
1.5 0.03 0.00 20000 0.00 0.00 memchr [11]
1.5 0.03 0.00 10000 0.00 0.00 __ultoa [12]
1.5 0.03 0.00 10000 0.00 0.00 _swrite [9]
1.5 0.03 0.00 10000 0.00 0.00 vfprintf [2]
What are you getting?
I tried running Kinopiko's example, except I increased the number of iterations by a factor of 100.
test.c:
#include <stdio.h>
int main ()
{
int i;
for (i = 0; i < 1000000; i++) {
printf ("%d\n", i);
}
return 0;
}
Then I took 10 stackshots (under VC, but you can use pstack). Here are the stackshots:
9 copies of this stack:
NTDLL! 7c90e514()
KERNEL32! 7c81cbfe()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
1 copy of this stack:
KERNEL32! 7c81cb96()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
In case it isn't obvious, this tells you that:
mainCRTStartup() line 206 + 25 bytes Cost ~100% of the time
main() line 7 + 14 bytes Cost ~100% of the time
printf() line 62 + 14 bytes Cost ~100% of the time
_ftbuf() line 171 + 9 bytes Cost ~100% of the time
_flush() line 162 + 23 bytes Cost ~100% of the time
_write() line 168 + 57 bytes Cost ~100% of the time
In a nutshell, the program spends ~100% of it's time flushing to disk (or console) the output buffer as part of the printf on line 7.
(What I mean by "Cost of a line" is - it is the fraction of total time spent at the request of that line, and that's roughly the fraction of samples that contain it.
If that line could be made to take no time, such as by removing it, skipping over it, or passing its work off to an infinitely fast coprocessor, that time fraction is how much the total time would shrink. So if the execution of any of these lines of code could be avoided, time would shrink by somewhere in the range of 95% to 100%. If you were to ask "What about recursion?", the answer is It Makes No Difference.)
Now, maybe you want to know something else, like how much time is spent in the loop, for example. To find that out, remove the printf because it's hogging all the time. Maybe you want to know what % of time is spent purely in CPU time, not in system calls. To get that, just throw away any stackshots that don't end in your code.
The point I'm trying to make is if you're looking for things you can fix to make the code run faster, the data gprof gives you, even if you understand it, is almost useless. By comparison, if there is some of your code that is causing more wall-clock time to be spent than you would like, stackshots will pinpoint it.
One gotcha with gprof
: it doesn't work with code in dynamically-linked libraries. For that, you need to use sprof
. See this answer: gprof : How to generate call graph for functions in shared library that is linked to main program
First compile you application with -g
, and check what CPU counters are you using.
If your application runs very quick than gprof could just miss all events or less that required (reduce the number of events to read).
Actually profiling should show you CPU_CLK_UNHALTED
or INST_RETIRED
events without any special switches. But with such data you'll be able only to say how well your code it performing: INST_RETIRED/CPU_CLK_UNHALTED.
Try to use Intel VTune profiler - it's free for 30 days and for education.