Does anybody know what is the meaning of stalled-cycles-frontend and stalled-cycles-backend in perf stat result ? I searched on the internet but did not find the answer. Thanks
$ sudo perf stat ls
Performance counter stats for 'ls':
0.602144 task-clock # 0.762 CPUs utilized
0 context-switches # 0.000 K/sec
0 CPU-migrations # 0.000 K/sec
236 page-faults # 0.392 M/sec
768956 cycles # 1.277 GHz
962999 stalled-cycles-frontend # 125.23% frontend cycles idle
634360 stalled-cycles-backend # 82.50% backend cycles idle
890060 instructions # 1.16 insns per cycle
# 1.08 stalled cycles per insn
179378 branches # 297.899 M/sec
9362 branch-misses # 5.22% of all branches [48.33%]
0.000790562 seconds time elapsed
To convert generic events exported by perf into your CPU documentation raw events you can run:
It will show you something like
According to the Intel documentation SDM volume 3B (I have a core i5-2520):
UOPS_ISSUED.ANY:
For the stalled-cycles-backend event translating to event=0xb1,umask=0x01 on my system the same documentation says:
UOPS_DISPATCHED.THREAD:
Usually, stalled cycles are cycles where the processor is waiting for something (memory to be feed after executing a load operation for example) and doesn't have any other stuff to do. Moreover, the frontend part of the CPU is the piece of hardware responsible to fetch and decode instructions (convert them to UOPs) where as the backend part is responsible to effectively execute the UOPs.
A CPU cycle is “stalled” when the pipeline doesn't advance during it.
Processor pipeline is composed of many stages: the front-end is a group of these stages which is responsible for the fetch and decode phases, while the back-end executes the instructions. There is a buffer between front-end and back-end, so when the former is stalled the latter can still have some work to do.
Taken from http://paolobernardi.wordpress.com/2012/08/07/playing-around-with-perf/
According to author of these events, they defined loosely and are approximated by available CPU performance counters. As I know, perf doesn't support formulas to calculate some synthetic event based on several hardware events, so it can't use front-end/back-end stall bound method from Intel's Optimization manual (Implemented in VTune) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf "B.3.2 Hierarchical Top-Down Performance Characterization Methodology"
Right formulas can be used with some external scripting, like it was done in Andi Kleen's pmu-tools (
toplev.py
): https://github.com/andikleen/pmu-tools (source), http://halobates.de/blog/p/262 (description):Commit which introduced stalled-cycles-frontend and stalled-cycles-backend events instead of original universal
stalled-cycles
:http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=8f62242246351b5a4bc0c1f00c0c7003edea128a
The theory:
Let's start from this: nowaday's CPU's are superscalar, which means that they can execute more than one instruction per cycle (IPC). Latest Intel architectures can go up to 4 IPC (4 x86 instruction decoders). Let's not bring macro / micro fusion into discussion to complicate things more :).
Typically, workloads do not reach IPC=4 due to various resource contentions. This means that the CPU is wasting cycles (number of instructions is given by the software and the CPU has to execute them in as few cycles as possible).
We can divide the total cycles being spent by the CPU in 3 categories:
To get an IPC of 4, the number of cycles retiring has to be close to the total number of cycles. Keep in mind that in this stage, all the micro-operations (uOps) retire from the pipeline and commit their results into registers / caches. At this stage you can have even more than 4 uOps retiring, because this number is given by the number of execution ports. If you have only 25% of the cycles retiring 4 uOps then you will have an overall IPC of 1.
The cycles stalled in the back-end are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, reciprocals, divisions, etc.).
The cycles stalled in the front-end are a waste because that means that the Front-End does not feed the Back End with micro-operations. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache. Just-in-time compiled code usually expresses this behavior.
Another stall reason is branch prediction miss. That is called bad speculation. In that case uOps are issued but they are discarded because the BP predicted wrong.
The implementation in profilers:
How do you interpret the BE and FE stalled cycles?
Different profilers have different approaches on these metrics. In vTune, categories 1 to 3 add up to give 100% of the cycles. That seams reasonable because either you have your CPU stalled (no uOps are retiring) either it performs usefull work (uOps) retiring. See more here: https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm
In perf this usually does not happen. That's a problem because when you see 125% cycles stalled in the front end, you don't know how to really interpret this. You could link the >1 metric with the fact that there are 4 decoders but if you continue the reasoning, then the IPC won't match.
Even better, you don't know how big the problem is. 125% out of what? What do the #cycles mean then?
I personally look a bit suspicious on perf's BE and FE stalled cycles and hope this will get fixed.
Probably we will get the final answer by debugging the code from here: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/perf/builtin-stat.c