Linux C++: how to profile time wasted due to cache

2020-01-27 15:15发布

问题:

I know that I can use gprof to benchmark my code.

However, I have this problem -- I have a smart pointer that has an extra level of indirection (think of it as a proxy object).

As a result, I have this extra layer that effects pretty much all functions, and screws with caching.

Is there a way to measure the time my CPU wastes due to cache misses?

Thanks!

回答1:

You could try cachegrind and it's front-end kcachegrind.



回答2:

You could find a tool that accesses the CPU performance counters. There is probably a register in each core that counts L1, L2, etc misses. Alternately Cachegrind performs a cycle-by-cycle simulation.

However, I don't think that would be insightful. Your proxy objects are presumably modified by their own methods. A conventional profiler will tell you how much time those methods are taking. No profile tool would tell you how performance would improve without that source of cache pollution. That's a matter of reducing the size and structure of the program's working set, which isn't easy to extrapolate.

A quick Google search turned up boost::intrusive_ptr which might interest you. It doesn't appear to support something like weak_ptr, but converting your program might be trivial, and then you would know for sure the cost of the non-intrusive ref counts.



回答3:

Linux supports with perf from 2.6.31 on. This allows you to do the following:

  • compile your code with -g to have debug information included
  • run your code e.g. using the last level cache misses counters: perf record -e LLC-loads,LLC-load-misses yourExecutable
  • run perf report
    • after acknowledging the initial message, select the LLC-load-misses line,
    • then e.g. the first function and
    • then annotate. You should see the lines (in assembly code, surrounded by the the original source code) and a number indicating what fraction of last level cache misses for the lines where cache misses occurred.


回答4:

Continuing along the lines of @Mike_Dunlavey's answer:

First, obtain a time based profile, using your favorite tool: VTune or PTU or OProf.

Then, obtain a cache miss profile. L1 cache misses, or L2 cache misses, or ...

I.e. the first profile associates a "time spent" with each program counter. The second associates a "number of cache misses" value with each program counter.

Note: I often "reduce" the data, summing it up by function, or (if I have the technology) by loop. Or by bins of, say, 64 bytes. Comparing individual program counters is often not useful, because the performance counters are fuzzy - the place where you see a cache miss get reported is often several instructions different from where it actually happened.

OK, so now graph these two profiles to compare them. Here are some graphs that I find useful:

"Iceberg" charts: X axis is PC, positive Y axis is time, negative Y access is cache misses. Look for places that go both up and down.

("Interleaved" charts are also useful: same idea, X axis is PC, plot both time and cache misseson Y axis, but with narrow vertical lines of different colors, typically red and blue. Places where is a lot of both time and cache misses spent will have finely interleaved red and blue lines, almost looking purple. This extends to L2 and L3 cache misses, all on the same graph. By the way, you probably want to "normalize" the numbers, either to %age of total time or cache misses, or, even better, %age of the maximum data point of time or cache misses. If you get the scale wrong, you won't see anything.)

XY charts: for each sampling bin (PC, or function, or loop, or...) plot a point whose X coordinate is the normalized time, and whose Y coordinate is the normalized cache misses. If you get a lot of data points in the upper right hand corner - large %age time AND large %age cache misses - that is interesting evidence. Or, forget number of points - if the sum of all percentages in the upper corner is big...

Note, unfortunately, that you often have to roll these analyses yourself. Last I checked VTune does not do it for you. I have used gnuplot and Excel. (Warning: Excel dies above 64 thousand data points.)


More advice:

If your smart pointer is inlined, you may get the counts all over the place. In an ideal world you would be able to trace back PCs to the original line of source code. In this case, you may want to defer the reduction a bit: look at all individual PCs; map them back to lines of source code; and then map those into the original function. Many compilers, e.g. GCC, have symbol table options that allow you to do this.

By the way, I suspect that your problem is NOT with the smart pointer causing cache thrashing. Unless you are doing smart_ptr<int> all over the place. If you are doing smart_ptr<Obj>, and sizeof(Obj) + is greater than say, 4*sizeof(Obj*) (and if the smart_ptr itself is not huge), then it is not that much.

More likely it is the extra level of indirection that the smart pointer does that is causing yor problem.

Coincidentally, I was talking to a guy at lunch who had a reference counted smart pointer that was using a handle, i.e. a level of indirection, something like

template<typename T> class refcntptr {
    refcnt_handle<T> handle;
public:
    refcntptr(T*obj) {
        this->handle = new refcnt_handle<T>();
        this->handle->ptr = obj;
        this->handle->count = 1;
    }
};
template<typename T> class refcnt_handle {
    T* ptr;
    int count;
    friend refcnt_ptr<T>;
};

(I wouldn't code it this way, but it serves for exposition.)

The double indirection this->handle->ptr can be a big performance problem. Or even a triple indirection, this->handle->ptr->field. At the least, on a machine with 5 cycle L1 cache hits, each this->handle->ptr->field would take 10 cycles. And be much harder to overlap than a single pointer chase. But, worse, if each is an L1 cache miss, even if it were only 20 cycles to the L2... well, it is much harder to hide 2*20=40 cycles of cache miss latency, than a single L1 miss.

In general, it is good advice to avoid levels of indirection in smart pointers. Instead of pointing to a handle, that all smart pointers point to, which itself points to the object, you might make the smart pointer bigger by having it point to the object as well as the handle. (Which then is no longer what is commonly called a handle, but is more like an info object.)

E.g.

template<typename T> class refcntptr {
    refcnt_info<T> info;
    T* ptr;
public:
    refcntptr(T*obj) {
        this->ptr = obj;
        this->info = new refcnt_handle<T>();
        this->info->count = 1;
    }
};
template<typename T> class refcnt_info {
    T* ptr; // perhaps not necessary, but useful.
    int count;
    friend refcnt_ptr<T>;
};

Anyway - a time profile is your best friend.


Oh, yeah - Intel EMON hardware can also tell you how many cycles you waited at a PC. That can distinguish a large number of L1 misses from a small number of L2 misses.



回答5:

It depends on what OS and CPU you are using. E.g. for Mac OS X and x86 or ppc, Shark will do cache miss profiling. Ditto for Zoom on Linux.



回答6:

If you're running an AMD processor, you can get CodeAnalyst, apparently free as in beer.



回答7:

My advice would be to use PTU (Performance Tuning Utility) from Intel.

This utility is the direct descendant of VTune and provide the best available sampling profiler available. You'll be able to track where the CPU is spending or wasting time (with the help of the available hardware events), and this with no slowdown of your application or perturbation of the profile. And of course you'll be able to gather all cache line misses events you are looking for.



回答8:

Another tool for CPU performance counter-based profiling is oprofile. You can view its results using kcachegrind.



回答9:

Here's kind of a general answer.

For example, if your program is spending, say, 50% of it's time in cache misses, then 50% of the time when you pause it the program counter will be at the exact locations where it is waiting for the memory fetches that are causing the cache misses.