How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?
I'm interested in all of the following cases:
full system userland benchmark. Maybe the
m5
guest tool has a way to do it?bare metal benchmark. When gem5 exits it dumps the stats automatically, so the main question is how to skip the cycles for bootloader and go straight to the benchmark itself.
Is there a way besides modifying the benchmark source with instrumentation instructions? How to write those instrumentation instructions in detail?
syscall emulation benchmark. I think gem5 just outputs the
stats.txt
at the end of the run, and then you ca just grepsystem.cpu.numCycles
, but I have to confirm it, currently blocked on: How to solve "FATAL: kernel too old" when running gem5 in syscall emulation SE mode?
I want to use this to learn:
- learn how CPUs work
- how to optimize assembly code or compiler settings to run optimally on a given CPU
m5
toolA good approximation is to run, ideally from a shell script that is the
/init
program:Then on host:
Gives something like:
Note that if you replay from a
m5 checkpoint
with a different CPU, e.g.:then you need to grep for a different identifier:
resetstats
zeroes out the cumulative stats, anddumpstats
dumps what has been collected during the benchmark.This is not perfect since there is some time between the exec syscall for
m5 dumpstats
finishing and the benchmark starting, but if the benchmark enough, this shouldn't matter.http://arm.ecs.soton.ac.uk/wp-content/uploads/2016/10/gem5_tutorial.pdf also proposes a few more heuristics:
m5 exit
also works since GEM5 dumps stats when it finishes.Instrumentation instructions
Sometimes those seem to be just inevitable that you have to modify the input source code a bit with those instructions in order to:
You can of course deduce those instructions from the gem5
m5
tool code code, but here are some very easy to re-use one line copy pastes for arm and aarch64, e.g. for aarch64:The
m5
tool uses the same mechanism under the hood, but by adding the instructions directly into the source, we avoid the syscall, and therefore more precise and representative (at the cost of more manual work).To ensure that the assembly is not reordered around your ROI by the compiler however, you might want to use the techniques mentioned at: Enforcing statement order in C++
Address monitoring
Another technique that can be used is to monitory addresses of interest instead of adding magic instructions to the source.
E.g., if you know that a benchmark starts with
PIC == 0x400
, it should be possible to do something when that addresses is hit.To find the addresses of interest, you would have for example to use
readelf
orgdb
or tracing, and the if running full system on top of Linux, ensure that ASLR is turned off.This technique would be the least intrusive one, but the setup is harder, and to be honest I haven't done it yet. One day, one day.