The description of the RESOURCE_STALLS.RS
hardware performance event for Intel Broadwell is the following:
This event counts stall cycles caused by absence of eligible entries
in the reservation station (RS). This may result from RS overflow, or
from RS deallocation because of the RS array Write Port allocation
scheme (each RS entry has two write ports instead of four. As a
result, empty entries could not be used, although RS is not really
full). This counts cycles that the pipeline backend blocked uop
delivery from the front end.
This basically says that there are two situations where the RS stall event occurs:
- When all of the eligible entries of the RS are occupied and the allocator is not stalled.
- When "RS deallocation" occurs because there are only two write ports, and the allocator is not stalled.
What does "eligible" mean in the first situation? Does this mean that not all entries can be occupied by all kinds of uops? Because my understanding is that in modern microarchitectures any entry can be used by any kind of uop. Also what is RS array Write Port allocation scheme and how does it cause RS stalls even when not all entries are occupied? Does this mean that there were four write ports in Haswell but now there are only two in Broadwell? Do either of these two situations apply to Skylake or Haswell even though the manual does not explicitly say so?
I've written a program that can be used to explore undocumented limitations of the RS in Intel processors in the hope that I'll be able to eventually answer the question. The basic idea is to make sure that the RS is completely empty before allocating and executing a specific sequence of uops in a loop. The RESOURCE_STALLS.RS
can be used to determine whether that sequence has hit a limitation in the RS itself. For example, if RESOURCE_STALLS.RS
is 1 per iteration, then the allocator had to stall for one cycle to allocate RS entries for all uops in the sequence. If RESOURCE_STALLS.RS
is much smaller than 1 per iteration, then it basically did not have to stall and so we know we didn't not hit any of the RS limitations.
I've experimented with a sequence of dependent ADD
instructions, a sequence of dependent BSWAP instructions, a sequence of dependent load instructions to the same location, a sequence of backward or forward unconditional jump instructions, and a sequence of store instructions to the same location. The following two graphs show the results for the sequence of add
instructions for different target RS occupancies (the maximum number of RS entries that will be simultaneously required and occupied by the sequence of uops). All the values are shown per iteration.
The following graph shows that RESOURCE_STALLS.RS
per iteration becomes at least (or anywhere near) 1 cycle per iteration when the RS occupancy is 50. Although it's not clearly visible, RESOURCE_STALLS.RS
becomes larger than zero when the RS occupancy exceeds 43, but only exceeds 1 when the RS occupancy exceeds 49. In other words, I'm only able to simultaneously use up to 49 RS entries out of the 60 (in Haswell) without RS stalls. After that, RESOURCE_STALLS.RS
increases on average by 1 per additional uop in the sequence, which is consistent with the bursty behavior of the allocator and the fact that each ADD
uop can be completed every cycle (each uop occupies an RS entry for 1 cycle only). cycles
increases on average by 2.3 per additional uop. It's larger than 1 per additional uop because there are also additional stalls on the ROB for reasons not related to the add
uops, but these are OK because they do not impact RESOURCE_STALLS.RS
.
The following graph shows the change in cycles
and RESOURCE_STALLS.RS
per iteration. It illustrates the strong correlation between execution time and RS stalls.
When the target RS occupancy is between 44-49, RESOURCE_STALLS.RS
is very small but still not really zero. I have also noticed that the exact order in which different uops are presented to the allocator slightly impacts the RS occupancy that can be reached. I think this is an effect of the RS array write port allocation scheme mentioned in the Intel manual.
So what's up with the other 11 RS entries (Haswell's RS is supposed to have 60 entries)? The RESOURCE_STALLS.ANY
performance event is the key to answer the question. I've updated the code I'm using to perform these experiments to test different kinds of loads:
- Loads that can be dispatched with speculative addresses to achieve 4 cycle L1D hit latency. This case is referred to as
loadspec
.
- Loads that cannot be dispatched with speculative addresses. These have an L1D hit latency of 5 cycles on Haswell. This case is referred to as
loadnonspec
.
- Loads that can be dispatched with speculative but incorrect addresses. These have an L1D hit latency of 9 cycles on Haswell. This case is referred to as
loadspecreplay
.
I followed the same approach with the ADD
instructions, but this time we need to watch RESOURCE_STALLS.ANY
instead of RESOURCE_STALLS.RS
(which actually doesn't capture RS stalls due to loads). The following graph shows the change in cycles
and RESOURCE_STALLS.ANY
per iteration. The first spike indicates the the target RS occupancy has exceeded the available RS entries for that kind of uop. We can clearly see that for the loadspec
case, there are exactly 11 RS entries for load uops! When the target RS occupancy exceeds 11, it takes on average 3.75 cycles for an RS entry to become free to the next load uop. This means that uops are deallocated from the RS when they complete, not when they get dispatched. This also explains how uop replay works. The spike for loadspecreplay
occurs at RS occupancy 6. The spike for loadnonspec
occurs at RS occupancy 9. As you will see later, these 11 entries are not dedicated for loads. Some of the 11 entries used by the loads may be among the 49 entries used by the ADD
uops.
I've also developed two test cases for stores: one that hits the limit of the store buffer and the other hits the limit of the RS. The graph above shows the former case. Note that a store needs two entries in the RS so the cases where the target RS occupancy is odd are the same as the previous even RS occupancies (change is zero). The graph shows that there can be up to 44/2 = 22 stores in the RS simultaneously. (The code I used to make the store graph had a bug in it that would make the achieved RS occupancy larger than what it is. After fixing it, the results show that there can be up to 20 stores in the RS simultaneously.) An entry occupied by a store-address or a store-data uop can be freed in one cycle. Intel says that the Haswell's store buffer has 42 entries, but I was not able to use all of these entries simultaneously. I'll probably have to design a different experiment to achieve that.
The jump sequences did not cause any stalls. I think this can be explained as follows: a jump uop frees the RS entry it occupies in one cycle and the allocator does not behave in a bursty way when it allocates jump uops. That is, every cycle one RS entry becomes free and the allocator will just allocate one jump uop without stalling. So we end up never stalling no matter how many jump uops there are. This is in contrast to add uops where the bursty allocator behavior makes it stall until the required number of RS entries become free (4 entries) even though the latency of an add uop is also one cycle. It makes sense that the jumps are allocated as soon as possible so that any mispredictions can be detected as early as possible. So if the allocator saw a jump and there is enough space in the RS for it but not later uops in its 4 uop group, then it would still allocate it. Otherwise, it might have to wait for potentially many cycles which can significantly delay the detection of mispredictions. This can be very costly
Is there an instruction whose uops can occupy all of the 60 entries of the RS simultaneously? Yes, one example is BSWAP
. It requires two RS entries for its two uops and I can clearly see using RESOURCE_STALLS.RS
that its uops can use all of the 60 entries of the RS simultaneously (assuming that my calculations are correct as to how the RS occupancy grows using the instruction). This proves that indeed there are exactly 60 entries in the RS. But there are constraints as to how they are used that we still don't know much about.
Yes, it is possible for RESOURCE_STALLS
to indicate a full RS before the RS is completely full.
As the RS becomes full, allocation of new uops into the RS becomes less ideal until at some point it may stall out entirely, even though some entries remain.
Furthermore, not all RS entries are available to all instructions. For example, on Haswell, I observe that only 30-32 of the 60 RS entries are available to loads: these entries may be special in they support uop replay, for example. On Skylake, the situation is different: the entire RS is not available to any type of instruction: rather, the "97 entry" RS is actually made up of a 64-entry RS for ALU ops, and a 33 entry RS for load ops. So the entire 97 entries of RS(es) will rarely be full, unless by some coincidence both fill up at exactly the same moment.
The RESOURCE_STALLS.RS
event (umask 0x4
) only triggers when a the "ALU" part of the RS is full (or full enough that an op can't allocate). For the load RS (which overlaps with the ALU RS in Haswell but not Skylake), the corresponding event has umask 0x40
. You can use it with perf
as 'cpu/event=0xa2,umask=0x40,name=resource_stalls_memrs_full/
. Although the events are not documented for Skylake, they seem to work fine (although events with umasks 0x10
through 0x80
are very different than documented on Sandy Bridge.
Future Intel chips are likely to have even finer-grained reservation stations.