Everyone always says to profile your program before performing optimizations but no-one ever describes how to do so.
What are your practices for profiling C code?
Everyone always says to profile your program before performing optimizations but no-one ever describes how to do so.
What are your practices for profiling C code?
Using gcc
, I compile and link with -pg
(as explained e.g. here), then continue by running the program (according to the principles also suggested at that URL) and using gprof
. The tools will vary if you're using different compilers &c, but the URL is still recommended, even then, for the parts that are about general ideas on how and why to profile your code.
If you are using Linux, then I recommend the combination of ValGrind and CallGrind and KCacheGrind. ValGrind is a superb method for finding memory leaks, and the CallGrind extension makes for a good profiler.
NOTE: I just learned that ValGrind now also works on Mac OSX. However, CallGrind and KCacheGrind haven't been updated since 2005. You might want to look at other front-ends.
Glad You Asked :-) If you don't mind contrarian, check these answers:
Let me try to put it in a nutshell:
Does the program wait for you, or do you wait for it? If it doesn't make you wait for it, then you don't have a problem, so leave it alone.
If it does make you wait, then proceed.
I recommend sampling, which is get stroboscopic X-rays of what the program is doing when it's busy (not waiting for you). Get samples at least of the call stack, not just the program counter. If you only get samples of the program counter, it will be meaningless if your program spends significant time in I/O or in library routines, so don't settle for that.
If you want to get a lot of samples, you need a profiler. If you only need a few, the pause button in the debugger works just fine. In my experience, 20 is more than enough, and 5 is often sufficient.
Why? Suppose you have 1000 samples of the call stack. Each sample represents a sliver of wall-clock time that is being spent only because every single line of code on the stack requested it. So, if there is a line of code that appears on 557 samples out of 1000, you can assume it is responsible for 557/1000 of the time, give or take a few samples (15). That means, if the entire execution time was costing you $100, that line by itself is costing $55.70, give or take $1.50 **, so you should look to see if you really need it.
But do you need 1000 samples? If that line is costing about 55.7% of the time, then if you only took 10 samples, you would see it on 6 of them, give or take 1.5 samples. So if you do see a statement on 6 out of 10 samples, you know it is costing you roughly between $45 and $75 out of that $100. Even if it's only costing as little as $45, wouldn't you want to see if you really need it?
That's why you don't need a lot of samples - you don't need a lot of accuracy. What you do need is what the stack samples give you - they point you precisely at the most valuable lines to optimize.
** The standard deviation of the number of samples is sqrt( f * (1-f) * nsamp )
where f
is the fraction of samples containing the line.
Shark / Instruments (using dtrace) are the profilers available on a Mac. They're pretty good.
For the sake of completion i would add oprofile. It is especially interesting if you want to benchmark the kernel.
Visual Studio Team System comes with a good profiler. Also, Intel VTune is not bad.