How can I measure real cpu usage in linux?

I know that there are tools like top and ps for measuring CPU usage, but the way they measure the CPU usage is by measuring how much time the idle task was not running. So, for example, even if a CPU has a stall due to a cache miss, these tools would still consider the CPU to be occupied. However, what I want is for the profiling tool to consider the CPU as idle during a stall. Is there any tool which does that?

tools like top and ps for measuring CPU usage, .. measure the CPU usage is by measuring how much time the idle task was not running.

No, they don't measure idle, they just read what kernel thinks about its CPU usage via /proc/stat (try vmstat 1 tool too). Did you check that system-wide user + system times are accounted only by idle? I think, kernel just exports some stats of scheduler, which records user/system state on rescheduling, both on system timer and on blocking system calls (possibly one of callers of cpuacct_charge, like update_curr - Update the current task's runtime statistics.).

/proc/stat example:

cat /proc/stat
cpu  2255 34 2290 22625563 6290 127 456

and decode by http://www.linuxhowtos.org/System/procstat.htm

The very first "cpu" line aggregates the numbers in all of the other "cpuN" lines. These numbers identify the amount of time the CPU has spent performing different kinds of work. Time units are in USER_HZ or Jiffies (typically hundredths of a second).

The meanings of the columns are as follows, from left to right:

user: normal processes executing in user mode
nice: niced processes executing in user mode
system: processes executing in kernel mode
idle: twiddling thumbs

When we hear jiffie, it means that scheduler was used to get the numbers, not estimating of idle task (top even don't see this task or tasks with pid 0).

So, for example, even if a CPU has a stall due to a cache miss, these tools would still consider the CPU to be occupied.

And basically (when there is no SMT, like HT in Intels), CPU is occupied when your task has pipeline stall due to memory access (or taking wrong path with out-of-order). OS can't run other task, because task switch is more expensive that waiting this one stall.

Case of SMT is different, because there is hardware which either switchs two logical tasks on single hardware, or even (in fine grained SMT) mixing their instructions (microoperations) into the single stream to execute on shared hardware. There are usually SMT statistic counters to check actual mixing.

However, what I want is for the profiling tool to consider the CPU as idle during a stall. Is there any tool which does that?

Performance monitoring unit may have useful events for this. For example, perf stat reports some (on Sandy Bridge)

$ perf stat /bin/sleep 10

 Performance counter stats for '/bin/sleep 10':
      0,563759 task-clock                #    0,000 CPUs utilized          
             1 context-switches          #    0,002 M/sec                  
             0 CPU-migrations            #    0,000 M/sec                  
           175 page-faults               #    0,310 M/sec                  
       888 767 cycles                    #    1,577 GHz                    
       568 603 stalled-cycles-frontend   #   63,98% frontend cycles idle   
       445 567 stalled-cycles-backend    #   50,13% backend  cycles idle   
       593 103 instructions              #    0,67  insns per cycle        
                                         #    0,96  stalled cycles per insn
       115 709 branches                  #  205,246 M/sec                  
         7 699 branch-misses             #    6,65% of all branches        

  10,000933734 seconds time elapsed

So, it says that 0,5 jiffie (task-clock) was used by the sleep 10. It is too low to be accounted in classic rusage, and /usr/bin/time got 0 jiffie as task CPU usage (user + system): $ /usr/bin/time sleep 10 0.00user 0.00system 0:10.00elapsed 0%CPU (0avgtext+0avgdata 2608maxresident)k 0inputs+0outputs (0major+210minor)pagefaults 0swaps

Then perf measures (counts with help of PMU) real cycles and real instructions executed by the task (and by kernel on behalf of the task) - cycles and instructions lines. Sleep has used 888k cycles but only 593k useful instructions were finished, and mean IPC was 0.6-0.7 (30-40% stalls). Around 300k cycles was lost; and on Sandy bridge perf reports where they were lost - stalled-cycles-* events for frontend (decoder - CPU don't know what to execute due to branch miss or due to code not prefetched to L1I) and for backend (can't execute because instruction needs some data from memory which is not available at right time - memory stall).

Why we see more stalls inside CPU when there should be only 300k cycles without any instruction executed? This is because modern processors are often superscalar and out-of-order - they can start to executing several instructions every CPU clock tick, and even reorder them. If you want to see execution port utilization, try ocperf (perf wrapper) from Andi Kleen's pmu-tools and some Intel manuals about their PMU counters. There is also toplev.py script to "identify the micro-architectural bottleneck for a workload" without selecting Intel events by hands.