I use Intel PCM for fine-grained CPU measurements. In my code, I am trying to measure the cache efficiency.
Basically, I first put a small array into the L1 cache (by traversing it many times), then I fire up the timer, go over the array one more time (which hopefully uses the cache), and then turning off the timer.
PCM shows me that I have a rather high L2 and L3 miss ratio. I also checked with rdtscp
and the cycles per array operation is 15 (which is much higher than 4-5 cycles for accessing L1 cache).
What I would expect is that the array is placed entirely in L1 cache, and I wouldn't have high L1, L2 and L3 miss ratio.
My system has 32K, 256K and 25M for L1, L2 and L3 respectively. Here's my code:
static const int ARRAY_SIZE = 16;
struct MyStruct {
struct MyStruct *next;
long int pad;
}; // each MyStruct is 16 bytes
int main() {
PCM * m = PCM::getInstance();
PCM::ErrorCode returnResult = m->program(PCM::DEFAULT_EVENTS, NULL);
if (returnResult != PCM::Success){
std::cerr << "Intel's PCM couldn't start" << std::endl;
exit(1);
}
MyStruct *myS = new MyStruct[ARRAY_SIZE];
// Make a sequential liked list,
for (int i=0; i < ARRAY_SIZE - 1; i++){
myS[i].next = &myS[i + 1];
myS[i].pad = (long int) i;
}
myS[ARRAY_SIZE - 1].next = NULL;
myS[ARRAY_SIZE - 1].pad = (long int) (ARRAY_SIZE - 1);
// Filling the cache
MyStruct *current;
for (int i = 0; i < 200000; i++){
current = &myS[0];
while ((current = current->n) != NULL)
current->pad += 1;
}
// Sequential access experiment
current = &myS[0];
long sum = 0;
SystemCounterState before = getSystemCounterState();
while ((current = current->n) != NULL) {
sum += current->pad;
}
SystemCounterState after = getSystemCounterState();
cout << "Instructions per clock: " << getIPC(before, after) << endl;
cout << "Cycles per op: " << getCycles(before, after) / ARRAY_SIZE << endl;
cout << "L2 Misses: " << getL2CacheMisses(before, after) << endl;
cout << "L2 Hits: " << getL2CacheHits(before, after) << endl;
cout << "L2 hit ratio: " << getL2CacheHitRatio(before, after) << endl;
cout << "L3 Misses: " << getL3CacheMisses(before_sstate,after_sstate) << endl;
cout << "L3 Hits: " << getL3CacheHits(before, after) << endl;
cout << "L3 hit ratio: " << getL3CacheHitRatio(before, after) << endl;
cout << "Sum: " << sum << endl;
m->cleanup();
return 0;
}
This is the output:
Instructions per clock: 0.408456
Cycles per op: 553074
L2 Cache Misses: 58775
L2 Cache Hits: 11371
L2 cache hit ratio: 0.162105
L3 Cache Misses: 24164
L3 Cache Hits: 34611
L3 cache hit ratio: 0.588873
EDIT: I also checked the following code, and still get the same miss ratios (which I would have expected to get almost zero miss ratios):
SystemCounterState before = getSystemCounterState();
// this is just a comment
SystemCounterState after = getSystemCounterState();
EDIT 2: As one commented suggested, these results might be due to the overhead of the profiler itself. So I instead of only one time, I changed the code traverse the array many times (200,000,000 times), to amortize the profiler's overhead. I still get very low L2 and L3 Cache ratios (%15).
It seems that you get l2 and l3 misses from all cores on your system
I look at the PCM implementation here: https://github.com/erikarn/intel-pcm/blob/ecc0cf608dfd9366f4d2d9fa48dc821af1c26f33/src/cpucounters.cpp
[1] in the implementation of
PCM::program()
on line 1407 I don't see any code that limits events to a specific process[2] in the implementation of
PCM::getSystemCounterState()
on line 2809 you can see that the events are gathered from all cores on your system. So I would try to set cpu affinity of the process to one core and then only read events from this core - with this functionCoreCounterState getCoreCounterState(uint32 core)