I have been trying to log all memory accesses of a program, which as I read seems to be impossible. I have been trying to see to what extent can I go to log atleast a major portion of the memory accesses, if not all. So I was looking to program the PEBS counters in such a way that I could see changes in the number of memory access samples collected. I wanted to know if I can do this by modifying the counter-reset value of PEBS counters. (Usually this goes to zero, but I want to set it to a higher value)
So I was looking to program these pebs counters on my own. Has anybody had experience manipulating the PEBS counters ? Specifically I was looking for good sources to see how to program them. I have gone through the Intel documentation and understood the steps. But I wanted to understand some sample programs. I have gone through the below github repo :-
https://github.com/pyrovski/powertools
But I am not quite sure, how and where to start. Are there any other good sources that I need to look ? Any suggestion for good resources to understand and start programming will be very helpful.
Please, don't mix tracing and timing measurements in single run.
It is just impossible both to have fastest run of Spec and all memory accesses traced. Do one run for timing and other (longer,slower) for memory access tracing.
In https://github.com/pyrovski/powertools the frequency of collected events is controlled by reset_val argument of
pebs_init
:https://github.com/pyrovski/powertools/blob/0f66c5f3939a9b7b88ec73f140f1a0892cfba235/msr_pebs.c#L72
This project is library to access PEBS, and there are no examples of its usage included in project (as I found there is only one disabled test in other projects by tpatki).
Check intel SDM Manual Vol 3B (this is the only good resource for PEBS programming) for meaning of the fields and PEBS configuration and output: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-734.html
(So, reset value is probably negative, equal to -1000 to get every 1000th event, -10 to get every 10th event. Counter will increment and PEBS is written at counter overflow.)
and https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-656.html 18.4.4 Processor Event Based Sampling (PEBS) "Table 18-10" - only L1/L2/DTLB misses have PEBS event in Intel Core. (Find PEBS section for your CPU and search for memory events. PEBS-capable events are really rare.)
So, to have more event recorded you probably want to set
reset
part of this function to smaller absolute value, like -50 or -10. With PEBS this may work (and tryperf -e cycles:upp -c 10
- don't ask to profile kernel with so high frequency, only user-space:u
and ask for precise with:pp
and ask for -10 counter with-c 10
. perf has all PEBS mechanics implemented both for MSR and for buffer parsing).Another good resource for PMU (hardware performance monitoring unit) are also from Intel, PMU Programming Guides. They have short and compact description both of usual PMU and PEBS too. There is public "Nehalem Core PMU", most of it still useful for newer CPUs - https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf (And there are uncore PMU guides: E5-2600 Uncore PMU Guide, 2012 https://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf)
External pdf about PEBS: https://www.blackhat.com/docs/us-15/materials/us-15-Herath-These-Are-Not-Your-Grand-Daddys-CPU-Performance-Counters-CPU-Hardware-Performance-Counters-For-Security.pdf#page=23 PMCs: Setting Up for PEBS - from "Black Hat USA 2015 - These are Not Your Grand Daddy's CPU Performance Counters"
You may start from short and simple program (not the ref inputs of recent SpecCPU) and use
perf
linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. PEBS is used withperf
by adding:p
and:pp
suffix to the event specifierrecord -e event:pp
. Also try pmu-tools ocperf.py for easier intel event name encoding.Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory tests like (worst case of memory recording overhead, left part on the Arithmetic Intensity scale of Roofline model - STREAM is BLAS1, GUPS and memlat are almost SpMV; real tasks are usually not so left on the scale):
Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?
Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes to be recorded to the memory. So, having memory tracing enabled (more than 10% or mem.access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.
Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"
PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly: