Profiling a (possibly I/O-bound) process to reduce

2019-03-15 14:54发布

问题:

I want to improve the performance of a specific method inside a larger application.

The goal is improving latency (wall-clock time spent in a specific function), not (neccessarily) system load.

Requirements:

  1. As I expect a lot of the latency to be due to I/O, take into account time spent waiting/blocked (in other words: look at wall clock time instead of CPU time)
  2. As the program does much more than the fragment i'm trying to optimize. There needs to be a way to either start/stop profiling programmatically, or to filter the output to only show the time between entering and exiting the function i'm optimizing.
  3. Profiling on the method level is acceptable (if it can be done on the instruction level, even better. if it only profiles system calls, that's probably not enough)
  4. This is for a hobby project, so expensive tools are not really an option
  5. Instrumentation (-finstrument-functions) is acceptable
  6. The critical piece of code I'm interested in is hard to interrupt manually (because it is already relatively fast and hard to realistically invoke in a loop), so some kind of automation is necessary.

Tools discarded so far:

  • gprof, oprofile, callgrind (requirement 1)
  • buiding something custom using getrusage (requirement 1)
  • poormansprofiler.org (requirement 2)
  • strace -T,dtrace,http://perf.wiki.kernel.org (requirements 2 and 3)
  • VTune,Zoom (requirement 4)
  • manual call-stack sampling (requirement 6)
  • google-perftools (should be able to measure wall time, but this does not appear to work in my case, presumably because SIGALRM interference.
  • systemtap (my kernel isn't patched to include utrace)

Other options which I haven't further evaluated yet:

  • cprof (doesn't build here out-of-the-box, seems i386-only)
  • manually inserting trace points (e.g. with lttng)

I'd love to hear about:

  • other options
  • perhaps I discarded some tool too soon?
  • whether or not the options I haven't evaluated yet have a chance of working, and if so, how to best do it.

I finally settled for:

  • building something custom using -finstrument-functions myself, based on http://balau82.wordpress.com/2010/10/06/trace-and-profile-function-calls-with-gcc/

The trace produced by this crude tool is hard to interpret, and I can easily imagine some tools for further processing its output making it infinitely more useful. However, this did the job for me for now, so I'm putting that project off until later ;).

回答1:

Use this method.

It is quite simple and effective at pinpointing opportunities for optimization, whether they are in CPU or IO bound code.

If you are right that the biggest opportunities are in a particular function or module, then it will find them. If they are elsewhere, it will find them.

Of the tools you mentioned and discarded, it is most similar to poor man's profiler, but still not very similar.

EDIT: Since you say it is triggered by a user interaction and blocks further input until it completes, here's how I would do it.

First, I assume it does not block a manual interrupt signal to the debugger, because otherwise you'd have no way to stop an infinite loop. Second, I would wrap a loop of 10, 100, or 1000 times around the routine in question, so it is doing it long enough to be manually interrupted.

Now, suppose it is spending some fraction of time doing I/O, like 50%. Then when you interrupt it, you have a 50% chance of catching it in the I/O. So if you catch it in the I/O, which the call stack will tell you, you can also see in great detail where the I/O is being requested from and why.

It will show you what's going on, which is almost certainly something surprising. If you see it doing something on as few as two (2) samples that you could find a way to eliminate, then you will get a considerable speedup. In fact, if you eliminate that activity, you don't know in advance how much time you will save, but on average you can expect to save fraction F = (s+1)/(n+2), where n is the total number of samples you took, and s is the number of samples that showed the activity. (Rule of Succession) Example, if you took 4 stack samples and saw the activity on 2 of them, on average it would save you F = 3/6 = 1/2, corresponding to a speedup factor of 1/(1-F) or 2.

Once you've done that, you can do it again and find something else to fix. The speedup factors multiply together like compound interest.

Then of course you remove the outer loop and "cash in" all the speedups you got.

If you are wondering how this differs from profiling, it is that by carefully examining each stack sample, and possibly related data, you can recognize activities that you could remove, where if all you've got is measurements, you are left trying to intuit what is going on. The actual amount of time you save is what it is, regardless of any measurements. The important thing is to find the problem. No matter how precisely a profiler might measure it, if you can't find it, you're not winning. These pages are full of people saying either they don't understand what their profiler is telling them, or it seems to be saying there is nothing to fix, which they are only too willing to accept. That's a case of rose-tinted glasses.

More on all that.



回答2:

For I/O bound applications you can use the --collect-systime=yes option of callgrind.

This collects time spent in system calls (in milliseconds). So if you believe you have an I/O bottleneck, you can use these stats to identify it.



回答3:

Todo: check out 'perf' (again)



回答4:

  • fork()
  • execxxx(process under test)
  • in the parent:
    • (in a loop) periodically call:
    • getrusage(RUSAGE_CHILDREN, ...)

getrusage() will not only gives you the cpu usage, but also the major/minor pagefaults, number of context switches,etc. The rest of the time is probably spent waiting for I/O. This won't give you detailed profiling information but a nice overall footprint of the program's behavior, comparable to running vmstat on a per-process basis.