I'm on an IvyBridge. I found the performance behavior of jnz
inconsistent in inner loop and outer loop.
The following simple program has an inner loop with fixed size 16:
global _start
_start:
mov rcx, 100000000
.loop_outer:
mov rax, 16
.loop_inner:
dec rax
jnz .loop_inner
dec rcx
jnz .loop_outer
xor edi, edi
mov eax, 60
syscall
perf
tool shows the outer loop runs 32c/iter. It suggests the jnz
requires 2 cycles to complete.
I then search in Agner's instruction table, conditional jump has 1-2 "reciprocal throughput", with a comment "fast if no jump".
At this point I start to believe the above behavior is somehow expected. But why does jnz
in an outer loop only require 1 cycle to complete?
If I remove the .loop_inner
part altogether, the outer loop runs 1c/iter. The behavior looks inconsistent.
What I am missing here?
Edit for more info:
The perf
results for the above program with command:
perf stat -ecycles,branches,branch-misses,lsd.uops,uops_issued.any -r4 ./a.out
is:
3,215,921,579 cycles ( +- 0.11% ) (79.83%)
1,701,361,270 branches ( +- 0.02% ) (80.05%)
19,212 branch-misses # 0.00% of all branches ( +- 17.72% ) (80.09%)
31,052 lsd.uops ( +- 76.58% ) (80.09%)
1,803,009,428 uops_issued.any ( +- 0.08% ) (79.93%)
The perf
result of the reference case:
global _start
_start:
mov rcx, 100000000
.loop_outer:
mov rax, 16
dec rcx
jnz .loop_outer
xor edi, edi
mov eax, 60
syscall
is:
100,978,250 cycles ( +- 0.66% ) (75.75%)
100,606,742 branches ( +- 0.59% ) (75.74%)
1,825 branch-misses # 0.00% of all branches ( +- 13.15% ) (81.22%)
199,698,873 lsd.uops ( +- 0.07% ) (87.87%)
200,300,606 uops_issued.any ( +- 0.12% ) (79.42%)
So the cause is mostly clear: LSD stops working for some reason in the nested case. Reducing the inner loop size will slightly mitigate the slowness, but not completely.
Searching Intel "optimization manual", I found that LSD won't work if the loop contains "more than eight taken branches". This somehow explains the behavior.
(Partial answer / speculation I didn't finish writing before Hadi posted a detailed analysis; some of this is continuing from comments)
Yes, Agner calls it the loopback buffer. His statement is that adding the LSD to the design doesn't speed up any code. But yes, it appears to be wrong for very tight loops, at least for nested loops. Apparently SnB/IvB do need the loop buffer to issue or execute 1c/iter loops. Unless the microarchitectural bottleneck is in fetching uops from the uop cache after branching, in which case his caveat covers this.
There are cases other than uop-cache misses where reading the uop cache can be a bottleneck. e.g. if uops didn't pack very well because of alignment effects, or if they use large immediates and/or displacements that take extra cycles to read from the uop cache. See the Sandybridge section of Agner Fog's uarch guide for more details about these effects. Your assumption that capacity (up to 1.5k uops if they pack perfectly) is the only reason it could be slow is very wrong.
BTW, a microcode update for Skylake disabled the LSD entirely to fix a partial-register merging bug, erratum SKL1501, and that did in fact have little effect except when a tiny loop spans a 32B boundary and needs 2 cache lines.
But Agner lists
JMP rel8/32
and taken JCC throughput as 1-2 cycles on HSW/SKL, vs. just 2 on IvB. So something about taken branches may have sped up since IvB other than the LSD itself.There might be some part(s) of the CPU other than LSD which also has a special case for long-running tiny loops that lets them run 1 taken jump per clock, on Haswell and later. I haven't tested what conditions cause 1 vs. 2 cycle taken branch throughput on HSW/SKL. Also note that Agner measured before the microcode update for erratum SKL150.
footnote 1: See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent, and note that SKX and Kaby Lake ship with microcode that already includes this. It's finally re-enabled in Coffee Lake, which fixed the buggy hard-wired logic so the LSD can safely be enabled again. https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Key_changes_from_Kaby_Lake
TL;DR: The DSB seems to be only able to deliver one jump uop of the inner loop every other cycle. Also DSB-MITE switches constitute up to 9% of execution time.
Introduction - Part 1: Understanding the LSD performance events
I'll first discuss when the
LSD.UOPS
andLSD.CYCLES_ACTIVE
performance events occur and some peculiarities of the LSD on the IvB and SnB microarchitectures. Once we establish this foundation, we can then answer the question. To do this, we can use small pieces of code that are especially designed to determine accurately when these events occur.According to the documentation:
These definitions are useful, but, as you'll see later, not precise enough to answer your question. It's important to develop a better understand of these events. Some of the information presented here is not documented by Intel and it's just my best interpretation of the empirical results and some of the related patents that I went through. Although I was not able to find the specific patent that describes the LSD implementation in SnB or later microarchitectures.
Each of the following benchmarks start with a comment that contains the name of the benchmark. All numbers are normalized per iteration, unless otherwise mentioned.
Both instructions in the loop body are mac-fused into a single uop. There is only one execution port on IvB and SnB that can execute jump instructions. Therefore, the maximum throughput should be 1c/iter. IvB is 10% faster, though, for some reason.
According to Is performance reduced when executing loops whose uop count is not a multiple of processor width?, the LSD in IvB and SnB cannot issue uops across the loop body boundaries even if there are available issue slots. Since the loop contains a single uop, we expect that the LSD will issue a single uop per cycle and that
LSD.CYCLES_ACTIVE
should about equal to the total number of cycles.On IvB,
LSD.UOPS
is as expected. That is, the LSD will issue one uop per cycle. Note that since the number of cycles is equal to the number of iterations which is equal to the number of uops, we can equivalently say that the LSD issues one uop per iteration. Essentially, most of the uops that were execute were issued from the LSD. However,LSD.CYCLES_ACTIVE
is about half of the number of cycles. How is this possible? In this case, shouldn't only half of the total number of uops be issued from the LSD? I think what is happening here is that the loop is being essentially unrolled twice and two uops are being issued per cycle. Nonetheless, only a single uop can be executed per cycle yetRESOURCE_STALLS.RS
is zero, indicating that RS never gets full. However,RESOURCE_STALLS.ANY
is about half of the cycle count. Putting all of this together now, it seems that the LSD is actually issuing 2 uops every other cycle and that there is some structural limitation that is being reached every other cycle.CYCLE_ACTIVITY.CYCLES_NO_EXECUTE
confirms that there is always at least one read uop in the RS at any given cycle. The following experiments will reveal the conditions for the unrolling to happen.On SnB,
LSD.UOPS
shows that twice the total number of uops were issued from the LSD. AlsoLSD.CYCLES_ACTIVE
indicates the LSD was active most of the time.CYCLE_ACTIVITY.CYCLES_NO_EXECUTE
andUOPS_ISSUED.STALL_CYCLES
are as on IvB. The following experiments are helpful to understand what's happening. It seems that the measuredLSD.CYCLES_ACTIVE
is equal to the realLSD.CYCLES_ACTIVE
+RESOURCE_STALLS.ANY
. Therefore, to get the realLSD.CYCLES_ACTIVE
,RESOURCE_STALLS.ANY
must be subtracted from the measuredLSD.CYCLES_ACTIVE
. The same applies toLSD.CYCLES_4_UOPS
. The realLSD.UOPS
can be calculated as follows:LSD.UOPS
measured =LSD.UOPS
real + ((LSD.UOPS
measured/LSD.CYCLES_ACTIVE
measured)*RESOURCE_STALLS.ANY
)Thus,
LSD.UOPS
real =LSD.UOPS
measured - ((LSD.UOPS
measured/LSD.CYCLES_ACTIVE
measured) *RESOURCE_STALLS.ANY
)=
LSD.UOPS
measured * (1 - (RESOURCE_STALLS.ANY
/LSD.CYCLES_ACTIVE
measured))For all of the benchmarks I've run on SnB (including those not shown here), these adjustments are accurate.
Note that
RESOURCE_STALLS.RS
andRESOURCE_STALLS.ANY
on SnB are just like IvB. So it seems that the LSD works the same way, as far as this particular benchmark is concerned, on IvB and SnB, except that the eventsLSD.UOPS
andLSD.CYCLES_ACTIVE
are counted differently.In B2, there are 2 uops per iteration and both are jumps. The first one is never taken, so there is still only one loop. We expect it to run at 2c/iter, which is indeed the case.
LSD.UOPS
shows that most uops were issued from LSD, butLSD.CYCLES_ACTIVE
shows that the LSD was active only half of the time. This means that the loop was not unrolled. So it seems that unrolling only occurs when there is a single uop in the loop.There are also 2 uops here, but the first one is a single-cycle ALU uop that is not related to the jump uop. B3 helps us answer the following two questions:
LSD.UOPS
andLSD.CYCLES_ACTIVE
still count twice on SnB?B3 shows that the answer to both question is a "No."
UOPS_ISSUED.STALL_CYCLES
suggests that the LSD will only stall one cycle if it issues two jump uops in one cycle. This never happens in B3, so there are no stalls.B4 has an additional twist to it; it contains 2 uops in the fused domain but 3 uops in the fused domain because the load-ALU instruction gets unfused in the RS. In the previous benchmarks, there were no micro-fused uops, only macro-fused uops. The goal here is to see how micro-fused uops are treated by the LSD.
LSD.UOPS
shows that the two uops of the load-ALU instruction have consumed a single issue slot (the fused jump uop consumes only a single slot). Also sinceLSD.CYCLES_ACTIVE
is equal tocycles
, no unrolling has occurred. The loop throughput is as expected.B5 is the last benchmark that we will need. It is similar to B2 in that it contains two branch uops. However, one of the jump uops in B5 is a forward unconditional jump. The results are identical to B2, indicating that it doesn't matter whether a jump uop is conditional or not. This is also case if the first jump uop is conditional and the second is not.
Introduction - Part 2: Branch prediction in the LSD
The LSD is mechanism implemented in the uop queue (IDQ) that can improve performance and reduce power consumption (consequently, heat emission is reduced). It can improve performance because some of the limitations that exist in the frontend may be relaxed in the uop queue. In particular, on SnB and IvB, both the MITE and DSB paths have a maximum throughput of 4uops/c, but in terms of bytes, it's 16B/c and 32B/c, respectively. The uop queue bandwidth is also 4uops/c, but has no limitation on the number of bytes. As long as the LSD issues uops from the uop queue, the frontend (i.e., the fetch and decode units) and even unneeded logic downstream from the IDQ can be powered down. Prior to Nehalem, the LSD was implemented in the IQ unit. Starting with Haswell, the LSD supports loops that contain uops from the MSROM. The LSD in Skylake processors is disabled because, apparently, it's buggy.
Loops usually contain at least one conditional branch. The LSD essentially monitors backward conditional branches and tries to determine a sequence of uops that constitute a loop. If the LSD takes too much time to detect a loop, performance may degrade and power may be wasted. On the other hand, if the LSD prematurely locks down a loop and attempts to replay it, the conditional jump of the loop may actually fall through. This can only be detected after executing the conditional jump, which means that later uops might have already issued and dispatched for execution. All of these uops need to be flushed and the frontend need to be activated to fetch uops from the correct path. So there can be a significant performance penalty if the performance improvement from using the LSD does not exceeds the performance degradation resulting from potentially mispredicting the last execution of the conditional branch where the loop is exited.
We already know that the branch prediction unit (BPU) on SnB and later can correctly predict when a conditional branch of a loop falls through when the total number of iterations does not exceed some small number, after which the BPU assumes that the loop will iteration forever. If the LSD uses the sophisticated capabilities of the BPU to predict when a locked down loop terminates, it should be able to correctly predict the same cases. It's possible also that the LSD uses its own branch predictor that is potentially much simpler. Let's find out.
Let
OC
andIC
denote the number of outer iterations and the number of inner iterations, respectively. These are related as follows:OC
= 100000000/(IC
+3) whereIC
> 0For any given
IC
, the total number of uops retired is the same. In addition, the number of uops in the fused domain is equal to the number of uops in the unfused domain. This is nice because it really simplifies the analysis and allows us to make a fair performance comparison between different values ofIC
.Compared to the code from the question, there is an additional instruction,
mov rbx, 1
, so that the total number of uops in the outer loop is exactly 4 uops. This enables us to make use of theLSD.CYCLES_4_UOPS
performance event in addition toLSD.CYCLES_ACTIVE
andBR_MISP_RETIRED.CONDITIONAL
. Note that since there is only a single branch execution port, each outer loop iteration takes at least 2 cycles (or according to Agner's table, 1-2 cycles). See also: Can the LSD issue uOPs from the next iteration of the detected loop?.The total number of jump uops is:
OC
+IC
*OC
= 100M/(IC
+3) +IC
*100M/(IC
+3)= 100M(
IC
+1)/(IC
+3)Assuming that the maximum jump uop throughput is 1 per cycle, the optimal execution time is 100M(
IC
+1)/(IC
+3) cycles. On IvB, we can instead use a maximum jump uop throughput of 0.9/c if we want to be strict. It would be useful to divide this by the number of inner iterations:OPT
= (100M(IC
+1)/(IC
+3)) / (100MIC
/(IC
+3)) =100M(
IC
+1) * (IC
+3) / (IC
+3) * 100MIC
=(
IC
+1)/IC
= 1 + 1/IC
Hence, 1 <
OPT
<= 1.5 forIC
> 1. The person designing the LSD can use this to compare different designs of the LSD. We'll use this shortly also. Putting it another way, the optimal performance is achieved when the total number of cycles divided by the total number of jumps is 1 (or 0.9 on IvB).Assuming that the prediction for the two jumps are independent and given that
jnz .loop_outer
is easily predictable, the performance depends on the prediction ofjnz .loop_inner
. On a misprediction that changes control to a uop outside of the locked down loop, the LSD terminates the loop and tries to detect another loop. The LSD can be represented as a state machine with three states. In one state, the LSD is looking for a looping behavior. In the second state, the LSD is learning the boundaries and the number of iterations of the loop. In the third state, the LSD is replaying the loop. When the loop exists, the state changes from the third to the first.As we have learned from the previous set of experiments, there will be extra LSD events on SnB when there are backend-related issue stalls. So the numbers need to be understood accordingly. Note that the case where
IC
=1 has not been tested in the previous section. It will be discussed here. Recall also that, on both IvB and SnB, the inner loop may get unrolled. The outer loop will never get unrolled because it contains more than one uop. By the way,LSD.CYCLES_4_UOPS
works as expected (sorry, no surprises there).The following figures show the raw results. I've only shown the results up to
IC
=13 andIC
=9 on IvB and SnB, respectively. I'll discuss in the next section what happens for larger values. Note that when the denominator is zero, the value cannot be computed and so it's not plotted.LSD.UOPS/100M
is the ratio of the number of uops issued from the LSD to the total number of uops.LSD.UOPS/OC
is the average number of uops issued from the LSD per outer iteration.LSD.UOPS/(OC*IC)
is the average number of uops issued from the LSD per inner iteration.BR_MISP_RETIRED.CONDITIONAL/OC
is the average number of retired conditional branches that were mispredicted per outer iteration, which is clearly zero on both IvB and SnB for allIC
.For
IC
=1 on IvB, all uops were issued from the LSD. The inner conditional branch is always not taken. TheLSD.CYCLES_4_UOPS/LSD.CYCLES_ACTIVE
metric shown in the second figure shows that in all of the cycles in which the LSD is active, the LSD is issuing 4 uops per cycle. We have learned from previous experiments that when the LSD issues 2 jump uops in the same cycle, it cannot issue jump uops in the next cycle due to some structural limitation, so it will stall.LSD.CYCLES_ACTIVE/cycles
shows that the LSD is stalling (almost) every other cycle. We expect that it takes about 2 cycles to execute an outer iteration, butcycles
shows that it takes about 1.8 cycles. This is probably related to the 0.9 jump uop throughput on IvB we have seen earlier.The case
IC
=1 on SnB is similar except for two things. First, an outer loop actually takes 2 cycles as expected, not 1.8. Second, all of the three LSD event counts are double what is expected. They can be adjusted as discussed in the previous section.Branch prediction is particularly interesting when
IC
>1. Let's analyze theIC
=2 case in detail.LSD.CYCLES_ACTIVE
andLSD.CYCLES_4_UOPS
show that in about 32% of all cycles, the LSD is active, and in 50% of these cycles, the LSD issues 4 uops per cycle. So there are either mispredictions or that LSD is taking a lot of time in the loop detection state or the learning state. Nonetheless,cycles
/(OC
*IC
) is about 1.6, or in other words,cycles
/jumps
is 1.07, which is close to the optimal performance. It's difficult to figure out which uops are being issued in groups of 4 from the LSD and which uops are being issued in groups of size less than 4 from the LSD. In fact, we don't know how the LSD events are counted in the presence of LSD mispredictions. Potential unrolling adds another level of complexity. The LSD event counts can be considered as upper bounds on the useful uops issued by the LSD and the cycles in which the LSD issued useful uops.As
IC
increase, bothLSD.CYCLES_ACTIVE
andLSD.CYCLES_4_UOPS
decrease and performance deteriorates slowly but consistently (remember thatcycles
/(OC
*IC
) should be compared againstOPT
). It is as if the last inner loop iteration is being mispredicted, but its misprediction penalty is increasing withIC
. Note that BPU always correctly predicts the number of inner loop iterations.The answer
I'll discuss what happens for any
IC
, why performance deteriorates for largerIC
, and what the upper and lower bounds on performance are. The following code will be used in this section:This is essentially the same as the code from the question. The only difference is that the number of outer iterations is adjusted to maintain the same number of dynamic uops. Note that
LSD.CYCLES_4_UOPS
is useless in this case because the LSD will never have 4 uops to issue in any cycle. All of the following figures are for IvB only. No worries, though, how SnB is different will be mentioned in the text.When
IC
=1,cycles
/jumps is 0.7 (1.0 on SnB), which is even lower than 0.9. I don't know how this throughput is being achieved. Performance decreases with larger values ofIC
, which correlates with the decrease in LSD active cycles. WhenIC
=13-27 (9-27 on SnB), zero uops get issued from the LSD. I think in this range, the LSD deems the performance impact due to mispredicting the last inner iteration is larger than some threshold, it decides to never lock down the loop and it remembers its decision. WhenIC
<13, the LSD appears to be aggressive and perhaps it considers the loop to be more predictable. ForIC
>27, the LSD active cycles count slowly grows and that correlates with gradual improvement in performance. Although not shown in the figure, asIC
grows far beyond 64, most of the uops will come from the LSD andcycles
/jumps settles at 0.9.The results for the range
IC
=13-27 is particularly useful. The issue stall cycles is about half of the total cycle count and is also equal to the dispatch stall cycles. It is precisely for this reason why the inner loop is executing at 2.0c/iter; because jump uops of the inner loop is being issued/dispatched every other cycle. When the LSD is not active, the uops can either come from the DSB, MITE, or the MSROM. Microcode assists are not required for our loop, so there is probably a limitation in either the DSB, MITE, or both. We can further investigate to determine where the limitations is using the frontend performance events. I've done this and the results show that about 80-90% of all uops come from the DSB. The DSB itself has many limitations and it seems that the loop is hitting one them. It seems that the DSB takes 2 cycles to deliver a jump uop that targets itself. In addition, for the fullIC
range, the stalls due to MITE-DSB switching consistute up to 9% of all cycles. Again, the reason for these switches is due to limitations in the DSB itself. Note that up 20% are being delivered from the MITE path. Assuming that the uops don't exceed the 16B/c bandwidth of the MITE path, I think the loop would have executed at 1c/iter if the DSB was not there.The above figure also shows BPU misprediction rate (per outer loop iteration). On IvB, it's zero for
IC
=1-33, except whenIC
=21, 0-1 whenIC
=34-45, and it's exactly 1 whenIC
>46. On SnB, it's zero forIC
=1-33 and 1 otherwise.