Instruction Level Profiling: The Meaning of the In

2019-04-06 05:00发布

问题:

When profiling code at the the assembly instruction level, what does the position of the instruction pointer really mean given that modern CPUs don't execute instructions serially or in-order? For example, assume the following x64 assembly code:

mov RAX, [RBX];         // Assume a cache miss here.
mov RSI, [RBX + RCX];   // Another cache miss.             
xor R8, R8;        
add RDX, RAX;           // Dependent on the load into RAX.
add RDI, RSI;           // Dependent on the load into RSI.

Which instruction will the instruction pointer spend most of its time on? I can think of good arguments for all of them:

  • mov RAX, [RBX] is taking probably 100s of cycles because it's a cache miss.
  • mov RSI, [RBX + RCX] also takes 100s of cycles, but probably executes in parallel with the previous instruction. What does it even mean for the instruction pointer to be on one or the other of these?
  • xor R8, R8 probably executes out-of-order and finishes before the memory loads finish, but the instruction pointer might stay here until all previous instructions are also finished.
  • add RDX, RAX generates a pipeline stall because it's the instruction where the value of RAX is actually used after a slow cache-miss load into it.
  • add RDI, RSI also stalls because it's dependent on the load into RSI.

回答1:

CPUs maintains a fiction that there are only the architectural registers (RAX, RBX, etc) and there is a specific instruction pointer (IP). Programmers and compilers target this fiction.

Yet as you noted, modern CPUs don't execute serially or in-order. Until you the programmer / user request the IP, it is like Quantum Physics, the IP is a wave of instructions being executed; all so that the processor can run the program as fast as possible. When you request the current IP (for example, via a debugger breakpoint or profiler interrupt), then the processor must recreate the fiction that you expect so it collapses this wave form (all "in flight" instructions), gathers the register values back into architectural names, and builds a context for executing the debugger routine, etc.

In this context, there is an IP that indicates the instruction where the processor should resume execution. During the out-of-order execution, this instruction was the oldest instruction yet to complete, even though at the time of the interrupt the processor was perhaps fetching instructions well past that point.

For example, perhaps the interrupt indicates mov RSI, [RBX + RCX]; as the IP, but the xor had already executed and completed; however, when the processor would resume execution after the interrupt, it will re-execute the xor.