clflush not flushing the instruction cache

2019-03-16 21:57发布

问题:

Consider the following code segment:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define ARRAYSIZE(arr) (sizeof(arr)/sizeof(arr[0]))


inline void
clflush(volatile void *p)
{
    asm volatile ("clflush (%0)" :: "r"(p));
}

inline uint64_t
rdtsc()
{
    unsigned long a, d;
    asm volatile ("cpuid; rdtsc" : "=a" (a), "=d" (d) : : "ebx", "ecx");
    return a | ((uint64_t)d << 32);
}

inline int func() { return 5;}

inline void test()
{
    uint64_t start, end;
    char c;
    start = rdtsc();
    func();
    end = rdtsc();
    printf("%ld ticks\n", end - start);
}

void flushFuncCache()
{
    // Assuming function to be not greater than 320 bytes.
    char* fPtr = (char*)func;
    clflush(fPtr);
    clflush(fPtr+64);
    clflush(fPtr+128);
    clflush(fPtr+192);
    clflush(fPtr+256);
}

int main(int ac, char **av)
{
    test();
    printf("Function must be cached by now!\n");
    test();
    flushFuncCache();
    printf("Function flushed from cache.\n");
    test();
    printf("Function must be cached again by now!\n");
    test();

    return 0;
}

Here, i am trying to flush the instruction cache to remove the code for 'func', and then expecting a performance overhead on the next call to func but my results don't agree to my expectations:

858 ticks
Function must be cached by now!
788 ticks
Function flushed from cache.
728 ticks
Function must be cached again by now!
710 ticks

I was expecting CLFLUSH to also flush the instruction cache, but apparently, it is not doing so. Can someone explain this behavior or suggest how to achieve the desired behavior.

回答1:

Your code does almost nothing in func, and the little you do gets inlined into test, and probably optimized out since you never use the return value.

gcc -O3 gives me -

0000000000400620 <test>:
  400620:       53                      push   %rbx
  400621:       0f a2                   cpuid
  400623:       0f 31                   rdtsc
  400625:       48 89 d7                mov    %rdx,%rdi
  400628:       48 89 c6                mov    %rax,%rsi
  40062b:       0f a2                   cpuid
  40062d:       0f 31                   rdtsc
  40062f:       5b                      pop    %rbx
  ...

So you're measuring time for the two moves that are very cheap HW-wise - your measurement is probably showing the latency of cpuid which is relatively expensive..

Worse, your clflush would actually flush test as well, this means you pay the re-fetch penalty when you next access it, which is out of the rdtsc pair so it's not measured. The measured code on the other hand, sequentially follows, so fetching test would probably also fetch the flushed code you measure, so it could actually be cached by the time you measure it.



回答2:

it works well on my computer.

264 ticks
Function must be cached by now!
258 ticks
Function flushed from cache.
519 ticks
Function must be cached again by now!
240 ticks