可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.

Previously our implementation used inline assembly and the rdtsc operation to query the high-frequency timer from the CPU directly, but this is problematic and requires frequent recalibration.

So I tried using the clock_gettime function instead to query CLOCK_PROCESS_CPUTIME_ID. The docs allege this gives me nanosecond timing, but I found that the overhead of a single call to clock_gettime() was over 250ns. That makes it impossible to time events 100ns long, and having such high overhead on the timer function seriously drags down app performance, distorting the profiles beyond value. (We have hundreds of thousands of profiling nodes per second.)

Is there a way to call clock_gettime() that has less than ¼μs overhead? Or is there some other way that I can reliably get the timestamp counter with <25ns overhead? Or am I stuck with using rdtsc?

Below is the code I used to time clock_gettime().

// calls gettimeofday() to return wall-clock time in seconds:
extern double Get_FloatTime();
enum { TESTRUNS = 1024*1024*4 };

// time the high-frequency timer against the wall clock
{
    double fa = Get_FloatTime();
    timespec spec; 
    clock_getres( CLOCK_PROCESS_CPUTIME_ID, &spec );
    printf("CLOCK_PROCESS_CPUTIME_ID resolution: %ld sec %ld nano\n", 
            spec.tv_sec, spec.tv_nsec );
    for ( int i = 0 ; i < TESTRUNS ; ++ i )
    {
        clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &spec );
    }
    double fb = Get_FloatTime();
    printf( "clock_gettime %d iterations : %.6f msec %.3f microsec / call\n",
        TESTRUNS, ( fb - fa ) * 1000.0, (( fb - fa ) * 1000000.0) / TESTRUNS );
}
// and so on for CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_THREAD_CPUTIME_ID.

Results:

CLOCK_PROCESS_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 3115.784947 msec 0.371 microsec / call
CLOCK_MONOTONIC resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2505.122119 msec 0.299 microsec / call
CLOCK_REALTIME resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2456.186031 msec 0.293 microsec / call
CLOCK_THREAD_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2956.633930 msec 0.352 microsec / call

This is on a standard Ubuntu kernel. The app is a port of a Windows app (where our rdtsc inline assembly works just fine).

Addendum:

Does x86-64 GCC have some intrinsic equivalent to __rdtsc(), so I can at least avoid inline assembly?

回答1:

No. You'll have to use platform-specific code to do it. On x86 and x86-64, you can use 'rdtsc' to read the Time Stamp Counter.

Just port the rdtsc assembly you're using.

__inline__ uint64_t rdtsc(void) {
  uint32_t lo, hi;
  __asm__ __volatile__ (      // serialize
  "xorl %%eax,%%eax \n        cpuid"
  ::: "%rax", "%rbx", "%rcx", "%rdx");
  /* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
  __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
  return (uint64_t)hi << 32 | lo;
}

回答2:

I ran some benchmarks on my system which is a quad core E5645 Xeon supporting a constant TSC running kernel 3.2.54 and the results were:

clock_gettime(CLOCK_MONOTONIC_RAW)       100ns/call
clock_gettime(CLOCK_MONOTONIC)           25ns/call
clock_gettime(CLOCK_REALTIME)            25ns/call
clock_gettime(CLOCK_PROCESS_CPUTIME_ID)  400ns/call
rdtsc (implementation @DavidSchwarz)     600ns/call

So it looks like on a reasonably modern system the (accepted answer) rdtsc is the worst route to go down.

回答3:

I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.

Have you considered oprofile or perf? You can use the performance counter hardware on your CPU to get profiling data without adding instrumentation to the code itself. You can see data per-function, or even per-line-of-code. The "only" drawback is that it won't measure wall clock time consumed, it will measure CPU time consumed, so it's not appropriate for all investigations.

回答4:

Give clockid_t CLOCK_MONOTONIC_RAW a try?

CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific) Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments or the incremental adjustments performed by adjtime(3).

From Man7.org

回答5:

Yes, most modern platforms will have a suitable clock_gettime call that is implemented purely in user-space using the VDSO mechanism, and will reliably take something like 20 to 30 nanoseconds to complete.

Internally, this is using rdtsc or rdtscp for the fine-grained portion of the time-keeping, plus adjustments to keep this in sync with wall-clock time (depending on the clock you choose) and a multiplication to convert from whatever units rdtsc has on your platform to nanoseconds.

Not all of the clocks offered by clock_gettime will implement this fast method, and it's not always obvious which ones do. Usually CLOCK_MONOTONIC is a good option, but you should test this on your own system.

回答6:

You are calling clock_getttime with control parameter which means the api is branching through if-else tree to see what kind of time you want. I know you cant't avoid that with this call, but see if you can dig into the system code and call what the kernal is eventually calling directly. Also, I note that you are including the loop time (i++, and conditional branch).