Why jnz requires 2 cycles to complete in an inner

2020-02-14 03:45发布

问题:

I'm on an IvyBridge. I found the performance behavior of jnz inconsistent in inner loop and outer loop.

The following simple program has an inner loop with fixed size 16:

global _start
_start:
    mov rcx, 100000000
.loop_outer:
    mov rax,    16

.loop_inner:
    dec rax
    jnz .loop_inner

    dec rcx
    jnz .loop_outer

    xor edi, edi
    mov eax, 60
    syscall

perf tool shows the outer loop runs 32c/iter. It suggests the jnz requires 2 cycles to complete.

I then search in Agner's instruction table, conditional jump has 1-2 "reciprocal throughput", with a comment "fast if no jump".

At this point I start to believe the above behavior is somehow expected. But why does jnz in an outer loop only require 1 cycle to complete?

If I remove the .loop_inner part altogether, the outer loop runs 1c/iter. The behavior looks inconsistent.

What I am missing here?

Edit for more info:

The perf results for the above program with command:

perf stat -ecycles,branches,branch-misses,lsd.uops,uops_issued.any -r4 ./a.out

is:

 3,215,921,579      cycles                                                        ( +-  0.11% )  (79.83%)
 1,701,361,270      branches                                                      ( +-  0.02% )  (80.05%)
        19,212      branch-misses             #    0.00% of all branches          ( +- 17.72% )  (80.09%)
        31,052      lsd.uops                                                      ( +- 76.58% )  (80.09%)
 1,803,009,428      uops_issued.any                                               ( +-  0.08% )  (79.93%)

The perf result of the reference case:

global _start
_start:
    mov rcx, 100000000
.loop_outer:
    mov rax,    16
    dec rcx
    jnz .loop_outer

    xor edi, edi
    mov eax, 60
    syscall

is:

   100,978,250      cycles                                                        ( +-  0.66% )  (75.75%)
   100,606,742      branches                                                      ( +-  0.59% )  (75.74%)
         1,825      branch-misses             #    0.00% of all branches          ( +- 13.15% )  (81.22%)
   199,698,873      lsd.uops                                                      ( +-  0.07% )  (87.87%)
   200,300,606      uops_issued.any                                               ( +-  0.12% )  (79.42%)

So the cause is mostly clear: LSD stops working for some reason in the nested case. Reducing the inner loop size will slightly mitigate the slowness, but not completely.

Searching Intel "optimization manual", I found that LSD won't work if the loop contains "more than eight taken branches". This somehow explains the behavior.

回答1:

TL;DR: The DSB seems to be only able to deliver one jump uop of the inner loop every other cycle. Also DSB-MITE switches constitute up to 9% of execution time.


Introduction - Part 1: Understanding the LSD performance events

I'll first discuss when the LSD.UOPS and LSD.CYCLES_ACTIVE performance events occur and some peculiarities of the LSD on the IvB and SnB microarchitectures. Once we establish this foundation, we can then answer the question. To do this, we can use small pieces of code that are especially designed to determine accurately when these events occur.

According to the documentation:

LSD.UOPS: Number of Uops delivered by the LSD.
LSD.CYCLES_ACTIVE: Cycles Uops delivered by the LSD, but didn't come from the decoder.

These definitions are useful, but, as you'll see later, not precise enough to answer your question. It's important to develop a better understand of these events. Some of the information presented here is not documented by Intel and it's just my best interpretation of the empirical results and some of the related patents that I went through. Although I was not able to find the specific patent that describes the LSD implementation in SnB or later microarchitectures.

Each of the following benchmarks start with a comment that contains the name of the benchmark. All numbers are normalized per iteration, unless otherwise mentioned.

; B1
----------------------------------------------------
    mov rax, 100000000
.loop:
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  0.90  |  1.00
LSD.UOPS                           |  0.99  |  1.99
LSD.CYCLES_ACTIVE                  |  0.49  |  0.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  0.43  |  0.50

Both instructions in the loop body are mac-fused into a single uop. There is only one execution port on IvB and SnB that can execute jump instructions. Therefore, the maximum throughput should be 1c/iter. IvB is 10% faster, though, for some reason.

According to Is performance reduced when executing loops whose uop count is not a multiple of processor width?, the LSD in IvB and SnB cannot issue uops across the loop body boundaries even if there are available issue slots. Since the loop contains a single uop, we expect that the LSD will issue a single uop per cycle and that LSD.CYCLES_ACTIVE should about equal to the total number of cycles.

On IvB, LSD.UOPS is as expected. That is, the LSD will issue one uop per cycle. Note that since the number of cycles is equal to the number of iterations which is equal to the number of uops, we can equivalently say that the LSD issues one uop per iteration. Essentially, most of the uops that were execute were issued from the LSD. However, LSD.CYCLES_ACTIVE is about half of the number of cycles. How is this possible? In this case, shouldn't only half of the total number of uops be issued from the LSD? I think what is happening here is that the loop is being essentially unrolled twice and two uops are being issued per cycle. Nonetheless, only a single uop can be executed per cycle yet RESOURCE_STALLS.RS is zero, indicating that RS never gets full. However, RESOURCE_STALLS.ANY is about half of the cycle count. Putting all of this together now, it seems that the LSD is actually issuing 2 uops every other cycle and that there is some structural limitation that is being reached every other cycle. CYCLE_ACTIVITY.CYCLES_NO_EXECUTE confirms that there is always at least one read uop in the RS at any given cycle. The following experiments will reveal the conditions for the unrolling to happen.

On SnB, LSD.UOPS shows that twice the total number of uops were issued from the LSD. Also LSD.CYCLES_ACTIVE indicates the LSD was active most of the time. CYCLE_ACTIVITY.CYCLES_NO_EXECUTE and UOPS_ISSUED.STALL_CYCLES are as on IvB. The following experiments are helpful to understand what's happening. It seems that the measured LSD.CYCLES_ACTIVE is equal to the real LSD.CYCLES_ACTIVE+RESOURCE_STALLS.ANY. Therefore, to get the real LSD.CYCLES_ACTIVE, RESOURCE_STALLS.ANY must be subtracted from the measured LSD.CYCLES_ACTIVE. The same applies to LSD.CYCLES_4_UOPS. The real LSD.UOPS can be calculated as follows:

LSD.UOPSmeasured = LSD.UOPSreal + ((LSD.UOPSmeasured/LSD.CYCLES_ACTIVEmeasured)*RESOURCE_STALLS.ANY)

Thus,

LSD.UOPSreal = LSD.UOPSmeasured - ((LSD.UOPSmeasured/LSD.CYCLES_ACTIVEmeasured) * RESOURCE_STALLS.ANY)
     = LSD.UOPSmeasured * (1 - (RESOURCE_STALLS.ANY/LSD.CYCLES_ACTIVEmeasured))

For all of the benchmarks I've run on SnB (including those not shown here), these adjustments are accurate.

Note that RESOURCE_STALLS.RS and RESOURCE_STALLS.ANY on SnB are just like IvB. So it seems that the LSD works the same way, as far as this particular benchmark is concerned, on IvB and SnB, except that the events LSD.UOPS and LSD.CYCLES_ACTIVE are counted differently.

; B2
----------------------------------------------------
    mov rax, 100000000
    mov rbx, 0
.loop:
    dec rbx
    jz .loop
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  1.98  |  2.00
LSD.UOPS                           |  1.92  |  3.99
LSD.CYCLES_ACTIVE                  |  0.94  |  1.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  1.00  |  1.00

In B2, there are 2 uops per iteration and both are jumps. The first one is never taken, so there is still only one loop. We expect it to run at 2c/iter, which is indeed the case. LSD.UOPS shows that most uops were issued from LSD, but LSD.CYCLES_ACTIVE shows that the LSD was active only half of the time. This means that the loop was not unrolled. So it seems that unrolling only occurs when there is a single uop in the loop.

; B3
----------------------------------------------------
    mov rax, 100000000
.loop:
    dec rbx
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  0.90  |  1.00
LSD.UOPS                           |  1.99  |  1.99
LSD.CYCLES_ACTIVE                  |  0.99  |  0.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  0.00  |  0.00

There are also 2 uops here, but the first one is a single-cycle ALU uop that is not related to the jump uop. B3 helps us answer the following two questions:

  • If the target of a jump is not a jump uop, will the LSD.UOPS and LSD.CYCLES_ACTIVE still count twice on SnB?
  • If the loop contains 2 uops where only one of them is a jump, will the LSD unroll the loop?

B3 shows that the answer to both question is a "No."

UOPS_ISSUED.STALL_CYCLES suggests that the LSD will only stall one cycle if it issues two jump uops in one cycle. This never happens in B3, so there are no stalls.

; B4
----------------------------------------------------
    mov rax, 100000000
.loop:
    add rbx, qword [buf]
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  0.90  |  1.00
LSD.UOPS                           |  1.99  |  2.00
LSD.CYCLES_ACTIVE                  |  0.99  |  1.00
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  0.00  |  0.00

B4 has an additional twist to it; it contains 2 uops in the fused domain but 3 uops in the fused domain because the load-ALU instruction gets unfused in the RS. In the previous benchmarks, there were no micro-fused uops, only macro-fused uops. The goal here is to see how micro-fused uops are treated by the LSD.

LSD.UOPS shows that the two uops of the load-ALU instruction have consumed a single issue slot (the fused jump uop consumes only a single slot). Also since LSD.CYCLES_ACTIVE is equal to cycles, no unrolling has occurred. The loop throughput is as expected.

; B5
----------------------------------------------------
    mov rax, 100000000
.loop:
    jmp .next
.next:
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  2.00  |  2.00
LSD.UOPS                           |  1.91  |  3.99
LSD.CYCLES_ACTIVE                  |  0.96  |  1.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  1.00  |  1.00

B5 is the last benchmark that we will need. It is similar to B2 in that it contains two branch uops. However, one of the jump uops in B5 is a forward unconditional jump. The results are identical to B2, indicating that it doesn't matter whether a jump uop is conditional or not. This is also case if the first jump uop is conditional and the second is not.

Introduction - Part 2: Branch prediction in the LSD

The LSD is mechanism implemented in the uop queue (IDQ) that can improve performance and reduce power consumption (consequently, heat emission is reduced). It can improve performance because some of the limitations that exist in the frontend may be relaxed in the uop queue. In particular, on SnB and IvB, both the MITE and DSB paths have a maximum throughput of 4uops/c, but in terms of bytes, it's 16B/c and 32B/c, respectively. The uop queue bandwidth is also 4uops/c, but has no limitation on the number of bytes. As long as the LSD issues uops from the uop queue, the frontend (i.e., the fetch and decode units) and even unneeded logic downstream from the IDQ can be powered down. Prior to Nehalem, the LSD was implemented in the IQ unit. Starting with Haswell, the LSD supports loops that contain uops from the MSROM. The LSD in Skylake processors is disabled because, apparently, it's buggy.

Loops usually contain at least one conditional branch. The LSD essentially monitors backward conditional branches and tries to determine a sequence of uops that constitute a loop. If the LSD takes too much time to detect a loop, performance may degrade and power may be wasted. On the other hand, if the LSD prematurely locks down a loop and attempts to replay it, the conditional jump of the loop may actually fall through. This can only be detected after executing the conditional jump, which means that later uops might have already issued and dispatched for execution. All of these uops need to be flushed and the frontend need to be activated to fetch uops from the correct path. So there can be a significant performance penalty if the performance improvement from using the LSD does not exceeds the performance degradation resulting from potentially mispredicting the last execution of the conditional branch where the loop is exited.

We already know that the branch prediction unit (BPU) on SnB and later can correctly predict when a conditional branch of a loop falls through when the total number of iterations does not exceed some small number, after which the BPU assumes that the loop will iteration forever. If the LSD uses the sophisticated capabilities of the BPU to predict when a locked down loop terminates, it should be able to correctly predict the same cases. It's possible also that the LSD uses its own branch predictor that is potentially much simpler. Let's find out.

mov rcx, 100000000/(IC+3)
.loop_outer:
    mov rax, IC
    mov rbx, 1 

.loop_inner:
    dec rax
    jnz .loop_inner

    dec rcx
    jnz .loop_outer

Let OC and IC denote the number of outer iterations and the number of inner iterations, respectively. These are related as follows:

OC = 100000000/(IC+3)       where IC > 0

For any given IC, the total number of uops retired is the same. In addition, the number of uops in the fused domain is equal to the number of uops in the unfused domain. This is nice because it really simplifies the analysis and allows us to make a fair performance comparison between different values of IC.

Compared to the code from the question, there is an additional instruction, mov rbx, 1, so that the total number of uops in the outer loop is exactly 4 uops. This enables us to make use of the LSD.CYCLES_4_UOPS performance event in addition to LSD.CYCLES_ACTIVE and BR_MISP_RETIRED.CONDITIONAL. Note that since there is only a single branch execution port, each outer loop iteration takes at least 2 cycles (or according to Agner's table, 1-2 cycles). See also: Can the LSD issue uOPs from the next iteration of the detected loop?.

The total number of jump uops is:

OC + IC*OC = 100M/(IC+3) + IC*100M/(IC+3)
     = 100M(IC+1)/(IC+3)

Assuming that the maximum jump uop throughput is 1 per cycle, the optimal execution time is 100M(IC+1)/(IC+3) cycles. On IvB, we can instead use a maximum jump uop throughput of 0.9/c if we want to be strict. It would be useful to divide this by the number of inner iterations:

OPT = (100M(IC+1)/(IC+3)) / (100MIC/(IC+3)) =
    100M(IC+1) * (IC+3) / (IC+3) * 100MIC =
    (IC+1)/IC = 1 + 1/IC

Hence, 1 < OPT <= 1.5 for IC > 1. The person designing the LSD can use this to compare different designs of the LSD. We'll use this shortly also. Putting it another way, the optimal performance is achieved when the total number of cycles divided by the total number of jumps is 1 (or 0.9 on IvB).

Assuming that the prediction for the two jumps are independent and given that jnz .loop_outer is easily predictable, the performance depends on the prediction of jnz .loop_inner. On a misprediction that changes control to a uop outside of the locked down loop, the LSD terminates the loop and tries to detect another loop. The LSD can be represented as a state machine with three states. In one state, the LSD is looking for a looping behavior. In the second state, the LSD is learning the boundaries and the number of iterations of the loop. In the third state, the LSD is replaying the loop. When the loop exists, the state changes from the third to the first.

As we have learned from the previous set of experiments, there will be extra LSD events on SnB when there are backend-related issue stalls. So the numbers need to be understood accordingly. Note that the case where IC=1 has not been tested in the previous section. It will be discussed here. Recall also that, on both IvB and SnB, the inner loop may get unrolled. The outer loop will never get unrolled because it contains more than one uop. By the way, LSD.CYCLES_4_UOPS works as expected (sorry, no surprises there).

The following figures show the raw results. I've only shown the results up to IC=13 and IC=9 on IvB and SnB, respectively. I'll discuss in the next section what happens for larger values. Note that when the denominator is zero, the value cannot be computed and so it's not plotted.

LSD.UOPS/100M is the ratio of the number of uops issued from the LSD to the total number of uops. LSD.UOPS/OC is the average number of uops issued from the LSD per outer iteration. LSD.UOPS/(OC*IC) is the average number of uops issued from the LSD per inner iteration. BR_MISP_RETIRED.CONDITIONAL/OC is the average number of retired conditional branches that were mispredicted per outer iteration, which is clearly zero on both IvB and SnB for all IC.

For IC=1 on IvB, all uops were issued from the LSD. The inner conditional branch is always not taken. The LSD.CYCLES_4_UOPS/LSD.CYCLES_ACTIVE metric shown in the second figure shows that in all of the cycles in which the LSD is active, the LSD is issuing 4 uops per cycle. We have learned from previous experiments that when the LSD issues 2 jump uops in the same cycle, it cannot issue jump uops in the next cycle due to some structural limitation, so it will stall. LSD.CYCLES_ACTIVE/cycles shows that the LSD is stalling (almost) every other cycle. We expect that it takes about 2 cycles to execute an outer iteration, but cycles shows that it takes about 1.8 cycles. This is probably related to the 0.9 jump uop throughput on IvB we have seen earlier.

The case IC=1 on SnB is similar except for two things. First, an outer loop actually takes 2 cycles as expected, not 1.8. Second, all of the three LSD event counts are double what is expected. They can be adjusted as discussed in the previous section.

Branch prediction is particularly interesting when IC>1. Let's analyze the IC=2 case in detail. LSD.CYCLES_ACTIVE and LSD.CYCLES_4_UOPS show that in about 32% of all cycles, the LSD is active, and in 50% of these cycles, the LSD issues 4 uops per cycle. So there are either mispredictions or that LSD is taking a lot of time in the loop detection state or the learning state. Nonetheless, cycles/(OC*IC) is about 1.6, or in other words, cycles/jumps is 1.07, which is close to the optimal performance. It's difficult to figure out which uops are being issued in groups of 4 from the LSD and which uops are being issued in groups of size less than 4 from the LSD. In fact, we don't know how the LSD events are counted in the presence of LSD mispredictions. Potential unrolling adds another level of complexity. The LSD event counts can be considered as upper bounds on the useful uops issued by the LSD and the cycles in which the LSD issued useful uops.

As IC increase, both LSD.CYCLES_ACTIVE and LSD.CYCLES_4_UOPS decrease and performance deteriorates slowly but consistently (remember that cycles/(OC*IC) should be compared against OPT). It is as if the last inner loop iteration is being mispredicted, but its misprediction penalty is increasing with IC. Note that BPU always correctly predicts the number of inner loop iterations.


The answer

I'll discuss what happens for any IC, why performance deteriorates for larger IC, and what the upper and lower bounds on performance are. The following code will be used in this section:

mov rcx, 100000000/(IC+2)
.loop_outer:
    mov rax, IC

.loop_inner:
    dec rax
    jnz .loop_inner

    dec rcx
    jnz .loop_outer

This is essentially the same as the code from the question. The only difference is that the number of outer iterations is adjusted to maintain the same number of dynamic uops. Note thatLSD.CYCLES_4_UOPS is useless in this case because the LSD will never have 4 uops to issue in any cycle. All of the following figures are for IvB only. No worries, though, how SnB is different will be mentioned in the text.

When IC=1, cycles/jumps is 0.7 (1.0 on SnB), which is even lower than 0.9. I don't know how this throughput is being achieved. Performance decreases with larger values of IC, which correlates with the decrease in LSD active cycles. When IC=13-27 (9-27 on SnB), zero uops get issued from the LSD. I think in this range, the LSD deems the performance impact due to mispredicting the last inner iteration is larger than some threshold, it decides to never lock down the loop and it remembers its decision. When IC<13, the LSD appears to be aggressive and perhaps it considers the loop to be more predictable. For IC>27, the LSD active cycles count slowly grows and that correlates with gradual improvement in performance. Although not shown in the figure, as IC grows far beyond 64, most of the uops will come from the LSD and cycles/jumps settles at 0.9.

The results for the range IC=13-27 is particularly useful. The issue stall cycles is about half of the total cycle count and is also equal to the dispatch stall cycles. It is precisely for this reason why the inner loop is executing at 2.0c/iter; because jump uops of the inner loop is being issued/dispatched every other cycle. When the LSD is not active, the uops can either come from the DSB, MITE, or the MSROM. Microcode assists are not required for our loop, so there is probably a limitation in either the DSB, MITE, or both. We can further investigate to determine where the limitations is using the frontend performance events. I've done this and the results show that about 80-90% of all uops come from the DSB. The DSB itself has many limitations and it seems that the loop is hitting one them. It seems that the DSB takes 2 cycles to deliver a jump uop that targets itself. In addition, for the full IC range, the stalls due to MITE-DSB switching consistute up to 9% of all cycles. Again, the reason for these switches is due to limitations in the DSB itself. Note that up 20% are being delivered from the MITE path. Assuming that the uops don't exceed the 16B/c bandwidth of the MITE path, I think the loop would have executed at 1c/iter if the DSB was not there.

The above figure also shows BPU misprediction rate (per outer loop iteration). On IvB, it's zero for IC=1-33, except when IC=21, 0-1 when IC=34-45, and it's exactly 1 when IC>46. On SnB, it's zero for IC=1-33 and 1 otherwise.



回答2:

(Partial answer / speculation I didn't finish writing before Hadi posted a detailed analysis; some of this is continuing from comments)

Agner's statement "the loop buffer has no measurable effect in the cases where the uop cache is not a bottleneck..." is wrong? Because this is certainly a measurable effect and the uop cache isn't bottleneck because the cache has ~1.5K capacity.

Yes, Agner calls it the loopback buffer. His statement is that adding the LSD to the design doesn't speed up any code. But yes, it appears to be wrong for very tight loops, at least for nested loops. Apparently SnB/IvB do need the loop buffer to issue or execute 1c/iter loops. Unless the microarchitectural bottleneck is in fetching uops from the uop cache after branching, in which case his caveat covers this.

There are cases other than uop-cache misses where reading the uop cache can be a bottleneck. e.g. if uops didn't pack very well because of alignment effects, or if they use large immediates and/or displacements that take extra cycles to read from the uop cache. See the Sandybridge section of Agner Fog's uarch guide for more details about these effects. Your assumption that capacity (up to 1.5k uops if they pack perfectly) is the only reason it could be slow is very wrong.

BTW, a microcode update for Skylake disabled the LSD entirely to fix a partial-register merging bug, erratum SKL1501, and that did in fact have little effect except when a tiny loop spans a 32B boundary and needs 2 cache lines.

But Agner lists JMP rel8/32 and taken JCC throughput as 1-2 cycles on HSW/SKL, vs. just 2 on IvB. So something about taken branches may have sped up since IvB other than the LSD itself.

There might be some part(s) of the CPU other than LSD which also has a special case for long-running tiny loops that lets them run 1 taken jump per clock, on Haswell and later. I haven't tested what conditions cause 1 vs. 2 cycle taken branch throughput on HSW/SKL. Also note that Agner measured before the microcode update for erratum SKL150.


footnote 1: See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent, and note that SKX and Kaby Lake ship with microcode that already includes this. It's finally re-enabled in Coffee Lake, which fixed the buggy hard-wired logic so the LSD can safely be enabled again. https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Key_changes_from_Kaby_Lake