This is something of a bloated question, so I apologize ahead of time. I'm curious about how gprof
works on a low technical level. I understand it's done by timers, but then why does the executable need to be specially compiled to be profiled? Does the compilation cause space to be allocated for statistics?
Also, how is the timing done exactly?
Well, this gives a good explanation. Also this explains statistical profiling
Essentially gprof will change the executable of your program (this is called instrumenting the code) to store some book-keeping information, e.g. how many times a function is called.
The statistical profiling bit comes from snooping the program counter regularly to get a sample of what your code is doing.
Gprof does both. It instruments the code and collects samples from looking at the program counter.
Having just read the paper (again), let me try to explain it.
Suppose it takes samples at 100Hz, except not when the process is blocked for IO or some other reason. Each sample records the PC, which is in some function. The count of samples in that function is incremented.
So, at the end of a run, if there were (say) 1000 samples, that means the total execution time (CPU-only) was 10 seconds. If routine B recorded, say, 500 of those samples, that means its total execution time is 1/2 of the total, or 5 seconds. This is its self time because the PC is in it. This does not tell how long it takes to execute, on average. To tell that, you need to know how many times it was called. Also it does not include time spent in callees.
When the code is compiled with the -pg flag, a special call is inserted in each routine's entry code. That notices that routine B is entered, and it notices that it is called from a call site in routine A. There is a table, indexed by that call site, where that call can be counted. So at the end, gprof can tell how many times B was called in total, and how many of those were from A.
To get the average self time of B, its total self time is divided by the number of times it is called.
In order to get the total cumulative time (self plus callees) of routine A, gprof needs the self time of A, plus the total number of times it calls each subordinate routine B, times the average cumulative time of B. That number is then divided by the total number of calls to A to get the average cumulative time of A.
Sounds good, until recursion enters the picture, where it gets even more confusing.
It's all very clever, and as the authors point out, full of caveats.