ARM performance counters vs linux clock_gettime

2019-04-01 08:23发布

问题:

I am using a Zynq chip on a development board ( ZC702 ) , which has a dual cortex-A9 MPCore at 667MHz and comes with a Linux kernel 3.3 I wanted to compare the execution time of a program so first a used clock_gettime and then used the counters provided by the co-processor of ARM. The counter increment every one processor cycle. ( based on this question of stackoverflow and this)

I compile the program with -O0 flag ( since I don't want any reordering or optimization done)

The time I measure with the performance counters is 583833498 ( cycles ) / 666.666687 MHz = 875750.221 (microseconds)

While using clock_gettime() ( either REALTIME or MONOTONIC or MONOTONIC_RAW ) the time measured is : 731627.126 ( microseconds) which is 150000 microseconds less..

Can anybody explain me why is this happening? Why is there a difference? The processor does not clock-scale , how is it possible to get less execution time measured by clock_gettime ? I have a sample code below:


#define RUNS 50000000
#define BENCHMARK(val) \
__asm__  __volatile__("mov r4, %1\n\t" \
                 "mov r5, #0\n\t" \
                 "1:\n\t"\
                 "add r5,r5,r4\n\t"\
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "sub r4,r4,#1\n\t" \
                 "cmp r4, #0\n\t" \
                 "bne 1b\n\t" \
                 "mov %0 ,r5  \n\t" \
                 :"=r" (val) \
                 : "r" (RUNS) \
                 : "r4","r5" \
        );
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
__asm__ __volatile__ ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(start_cycles));
for(index=0;index<5;index++)
{
    BENCHMARK(i);
}
__asm__ __volatile__ ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(end_cycles));
clock_gettime(CLOCK_MONOTONIC_RAW,&stop);

回答1:

I found the solution. I upgraded the platform from a linux kernel 3.3.0 to 3.5 and the value is similar to that of the performance counters. Apparently the frequency of the clock counter in 3.3.0 is assumed higher that what it is ( around 400 MHz ) instead of half of the CPU frequency. Probably a porting error in the old version.



回答2:

The POSIX clocks operate within certain precision, which you can get with clock_getres. Check if that 150,000us difference is inside or outside the error margin.

In any case, it shouldn't matter, you should repeat you benchmark many times, not 5, but 1000 or more. You can then get the timing of a single benchmark run like

((end + e1) - (start + e0)) / 1000, or

(end - start) / 1000 + (e1 - e0) / 1000.

If e1 and e0 are the error terms, which are bound by a small constant, your maximum measurement error will be abs (e1 - e0) / 1000, which will be negligible as the number of loops increase.