I know that there are tools like top and ps for measuring CPU usage, but the way they measure the CPU usage is by measuring how much time the idle task was not running. So, for example, even if a CPU has a stall due to a cache miss, these tools would still consider the CPU to be occupied. However, what I want is for the profiling tool to consider the CPU as idle during a stall. Is there any tool which does that?
相关问题
- Multiple sockets for clients to connect to
- Is shmid returned by shmget() unique across proces
- What is the best way to do a search in a large fil
- glDrawElements only draws half a quad
- how to get running process information in java?
No, they don't measure idle, they just read what kernel thinks about its CPU usage via /proc/stat (try
vmstat 1
tool too). Did you check that system-wide user + system times are accounted only byidle
? I think, kernel just exports some stats of scheduler, which records user/system state on rescheduling, both on system timer and on blocking system calls (possibly one of callers ofcpuacct_charge
, likeupdate_curr
- Update the current task's runtime statistics.)./proc/stat example:
and decode by http://www.linuxhowtos.org/System/procstat.htm
When we hear jiffie, it means that scheduler was used to get the numbers, not estimating of
idle
task (top even don't see this task or tasks with pid 0).And basically (when there is no SMT, like HT in Intels), CPU is occupied when your task has pipeline stall due to memory access (or taking wrong path with out-of-order). OS can't run other task, because task switch is more expensive that waiting this one stall.
Case of SMT is different, because there is hardware which either switchs two logical tasks on single hardware, or even (in fine grained SMT) mixing their instructions (microoperations) into the single stream to execute on shared hardware. There are usually SMT statistic counters to check actual mixing.
Performance monitoring unit may have useful events for this. For example,
perf stat
reports some (on Sandy Bridge)So, it says that 0,5 jiffie (task-clock) was used by the sleep 10. It is too low to be accounted in classic rusage, and /usr/bin/time got 0 jiffie as task CPU usage (user + system): $ /usr/bin/time sleep 10 0.00user 0.00system 0:10.00elapsed 0%CPU (0avgtext+0avgdata 2608maxresident)k 0inputs+0outputs (0major+210minor)pagefaults 0swaps
Then perf measures (counts with help of PMU) real cycles and real instructions executed by the task (and by kernel on behalf of the task) -
cycles
andinstructions
lines. Sleep has used 888k cycles but only 593k useful instructions were finished, and mean IPC was 0.6-0.7 (30-40% stalls). Around 300k cycles was lost; and on Sandy bridgeperf
reports where they were lost -stalled-cycles-*
events for frontend (decoder - CPU don't know what to execute due to branch miss or due to code not prefetched to L1I) and for backend (can't execute because instruction needs some data from memory which is not available at right time - memory stall).Why we see more stalls inside CPU when there should be only 300k cycles without any instruction executed? This is because modern processors are often superscalar and out-of-order - they can start to executing several instructions every CPU clock tick, and even reorder them. If you want to see execution port utilization, try
ocperf
(perf wrapper) from Andi Kleen's pmu-tools and some Intel manuals about their PMU counters. There is alsotoplev.py
script to "identify the micro-architectural bottleneck for a workload" without selecting Intel events by hands.