When profiling code at the the assembly instruction level, what does the position of the instruction pointer really mean given that modern CPUs don't execute instructions serially or in-order? For example, assume the following x64 assembly code:
mov RAX, [RBX]; // Assume a cache miss here.
mov RSI, [RBX + RCX]; // Another cache miss.
xor R8, R8;
add RDX, RAX; // Dependent on the load into RAX.
add RDI, RSI; // Dependent on the load into RSI.
Which instruction will the instruction pointer spend most of its time on? I can think of good arguments for all of them:
mov RAX, [RBX]
is taking probably 100s of cycles because it's a cache miss.mov RSI, [RBX + RCX]
also takes 100s of cycles, but probably executes in parallel with the previous instruction. What does it even mean for the instruction pointer to be on one or the other of these?xor R8, R8
probably executes out-of-order and finishes before the memory loads finish, but the instruction pointer might stay here until all previous instructions are also finished.add RDX, RAX
generates a pipeline stall because it's the instruction where the value ofRAX
is actually used after a slow cache-miss load into it.add RDI, RSI
also stalls because it's dependent on the load intoRSI
.