Benchmarking code - am I doing it right?

I want to benchmark a C/C++ code. I want to measure cpu time, wall time and cycles/byte. I wrote some mesurement functions but have a problem with cycles/byte.

To get a cpu time I wrote a function getrusage() with RUSAGE_SELF, for wall time i use clock_gettime with MONOTONIC, to get cycles/byte I use rdtsc.

I process an input buffer of size, for example, 1024: char buffer[1024]. How do I benchmark:

Do a warm-up phase, simply call fun2measure(args) 1000 times:

for(int i=0; i<1000; i++) fun2measure(args);

Then, do a real-timing benchmark, for wall time:

`unsigned long i; double timeTaken; double timeTotal = 3.0; // process 3 seconds

for (timeTaken=(double)0, i=0; timeTaken <= timeTotal; timeTaken = walltime(1), i++) fun2measure(args); `
And for cpu time (almost the same):

for (timeTaken=(double)0, i=0; timeTaken <= timeTotal; timeTaken = walltime(1), i++) fun2measure(args);

But when I want to get a cpu cycle count for function, I use this piece of code:

`unsigned long s = cyclecount();
    for (timeTaken=(double)0, i=0; timeTaken <= timeTotal; timeTaken = walltime(1), i++)
    {
        fun2measure(args);
    }
    unsigned long e = cyclecount();

unsigned long s = cyclecount();
    for (timeTaken=(double)0, i=0; timeTaken <= timeTotal; timeTaken = cputime(1), i++)
    {
        fun2measure(args);
    }
    unsigned long e = cyclecount();`

and then, count cycles/byte: ((e - s) / (i * inputsSize);. Here inputsSize is 1024 because its the length of the buffer. But when I rise totalTime to 10s I ge strange results:

for 10s:

Did fun2measure 1148531 times in 10.00 seconds for 1024 bytes, 0 cycles/byte [CPU]
Did fun2measure 1000221 times in 10.00 seconds for 1024 bytes, 3.000000 cycles/byte [WALL]

for 5s:

Did fun2measure 578476 times in 5.00 seconds for 1024 bytes, 0 cycles/byte [CPU]
Did fun2measure 499542 times in 5.00 seconds for 1024 bytes, 7.000000 cycles/byte [WALL]

for 4s:

Did fun2measure 456828 times in 4.00 seconds for 1024 bytes, 4 cycles/byte [CPU]
Did fun2measure 396612 times in 4.00 seconds for 1024 bytes, 3.000000 cycles/byte [WALL]

My questions:

Are those results ok?
Why when I increase time I always get 0 cycles/byte in cpu?
How can I measure average time, mean, standard deviation etc statistics for such benchmarking?
Is my benchmarking method 100% ok?

CHEERS!

1st EDIT:

After changing i to double:

Did fun2measure 1138164.00 times in 10.00 seconds for 1024 bytes, 0.410739 cycles/byte [CPU]
Did fun2measure 999849.00 times in 10.00 seconds for 1024 bytes, 3.382036 cycles/byte [WALL]

my results seem to be ok. So question #2 isnt a question anymore:)

Your cyclecount benchmark is flawed as it includes the cost for walltime/cputime function calls. In general though, I strongly urge you to use a proper profiler instead of trying to reinvent the wheel. Especially performance counters will give you numbers that you can rely on. Also note that cycles are very unreliable as the CPU is usually not running at a fixed frequency or the kernel may do a task switch and halt your app for some time.

I personally write benchmarks such that they run a given function N times, for N being large enough such that you get enough samples. Externally then I apply a profiler such as linux perf to get me some hard numbers to reason about. Repeating the benchmark a given time you can then calculate stddev/avg values, which you can do in a script that runs the benchmark a few times and evaluates the output of the profiler.