I am using a Zynq chip on a development board ( ZC702 ) , which has a dual cortex-A9 MPCore at 667MHz and comes with a Linux kernel 3.3 I wanted to compare the execution time of a program so first a used clock_gettime and then used the counters provided by the co-processor of ARM. The counter increment every one processor cycle. ( based on this question of stackoverflow and this)
I compile the program with -O0 flag ( since I don't want any reordering or optimization done)
The time I measure with the performance counters is 583833498 ( cycles ) / 666.666687 MHz = 875750.221 (microseconds)
While using clock_gettime() ( either REALTIME or MONOTONIC or MONOTONIC_RAW ) the time measured is : 731627.126 ( microseconds) which is 150000 microseconds less..
Can anybody explain me why is this happening? Why is there a difference? The processor does not clock-scale , how is it possible to get less execution time measured by clock_gettime ? I have a sample code below:
#define RUNS 50000000
#define BENCHMARK(val) \
__asm__ __volatile__("mov r4, %1\n\t" \
"mov r5, #0\n\t" \
"1:\n\t"\
"add r5,r5,r4\n\t"\
"mov r4 ,r4 \n\t" \
"mov r4 ,r4 \n\t" \
"mov r4 ,r4 \n\t" \
"mov r4 ,r4 \n\t" \
"mov r4 ,r4 \n\t" \
"mov r4 ,r4 \n\t" \
"mov r4 ,r4 \n\t" \
"mov r4 ,r4 \n\t" \
"mov r4 ,r4 \n\t" \
"mov r4 ,r4 \n\t" \
"sub r4,r4,#1\n\t" \
"cmp r4, #0\n\t" \
"bne 1b\n\t" \
"mov %0 ,r5 \n\t" \
:"=r" (val) \
: "r" (RUNS) \
: "r4","r5" \
);
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
__asm__ __volatile__ ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(start_cycles));
for(index=0;index<5;index++)
{
BENCHMARK(i);
}
__asm__ __volatile__ ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(end_cycles));
clock_gettime(CLOCK_MONOTONIC_RAW,&stop);