Dependent loads reordering in CPU

2019-01-19 18:25发布

问题:

I have been reading Memory Barriers: A Hardware View For Software Hackers, a very popular article by Paul E. McKenney.

One of the things the paper highlights is that, very weakly ordered processors like Alpha, can reorder dependent loads which seems to be a side effect of partitioned cache

Snippet from the paper:

1 struct el *insert(long key, long data)
2 {
3     struct el *p;
4     p = kmalloc(sizeof(*p), GPF_ATOMIC);
5     spin_lock(&mutex);
6     p->next = head.next;
7     p->key = key;
8     p->data = data; 
9     smp_wmb();
10    head.next = p;
11    spin_unlock(&mutex);
12 }
13
14 struct el *search(long key)
15 {
16     struct el *p;
17     p = head.next;
18     while (p != &head) {
19         /* BUG ON ALPHA!!! */
20         if (p->key == key) {
21             return (p);
22         }
23         p = p->next;
24     };
25     return (NULL);
26 }
  1. There are 2 processors CPU0 and CPU1.
  2. Each CPU has 2 cache banks CB0( odd address ), CB1( even address ).
  3. Head is in CB0 and P in CB1.
  4. The insert() has a write barrier which ensure that invalidation for line 6-8 is the in bus first followed by invalidation at line 10.
  5. However, the other processor executing search can have CB0 lightly loaded and CB1 heavily loaded.
  6. This means the processor leads the latest value of head but old value of p ( because the invalidation request for p is not processed by CB1 yet. )

Question: Looks like all architectures expect Alpha honor dependent loads. For example: IA64 can reorder the following except Dependent loads reordering.

  1. Load reordered after load
  2. Load reordered after store
  3. Stores reordered after stores
  4. Stores reordered after load
  5. Atomic instruction reordered with loads.
  6. Atomic Instructions reordered with stores.

This makes me wonder what hardware support is required to prevent dependent load reordering.

One possible answer is that all other architecture( IA64) do not have a partitioned cache and hence would not run into this issue and no explicit hardware support is required.

Any insights ?

回答1:

Short answer:

In an out-of-order processor the load-store queue is used to track and enforce memory ordering constraints. Processors such as the Alpha 21264 have the necessary hardware to prevent dependent load reordering, but enforcing this dependency could add overhead for inter-processor communication.

Long answer:

Background on dependence tracking

This is probably best explained using an example. Imagine that you had the following sequence of instructions (pseudo-code instructions used for simplicity):

ST R1, A       // store value in register R1 to memory at address A
LD B, R2       // load value from memory at address B to register R2
ADD R2, 1, R2  // add immediate value 1 to R2 and save result in R2

In this example there is a dependency between the LD and the ADD instruction. The ADD reads the value of R2 and so it cannot execute until the LD makes that value available. This dependency is through a register and it is something that the processor's issue logic can track.

However, there could also be a dependency between the ST and the LD, if address A and B were the same. But unlike the dependence between the LD and the ADD, the possible dependence between the ST and the LD is not known at the time the instruction is issued (begins execution).

Instead of trying to detect memory dependencies at issue time, the processor keeps track of them using a structure called the load-store queue. What this structure does is keep track of the addresses of pending loads and stores for instructions that have been issued but not yet retired. If there is a memory ordering violation this can be detected and execution can be restarted from the point where the violation occurred.

So going back to the pseudo-code example, you could imagine a situation where the LD is executed before the ST (perhaps the value needed in R1 wasn't ready for some reason). But when the ST executes it sees that address A and B are the same. So the LD should really have read the value that was produced by the ST, rather than the stale value that was already in the cache. As a result the LD will need to be re-executed, along with any instructions that came after the LD. There are various optimizations possible to reduce some of this overhead, but the basic idea holds.

As I mentioned earlier the logic to detect this dependence exists in all out-of-order processors that allow speculative execution of memory instructions (including Alpha processors).

Memory ordering rules

However, memory ordering rules don't just constrain the order that a processor sees results from its own memory operations. Instead memory ordering rules constrain the relative order of that operations memory operations performed on one processor become visible to other processors.

Alpha example

In the case of dependent load reordering, the processor has to track this information for its own use, but Alpha ISA does not require it to make sure that other processors see this ordering. One example of how this can occur is the following (I've quoted from this link)

Initially: p = & x, x = 1, y = 0

    Thread 1         Thread 2
--------------------------------
  y = 1         |    
  memoryBarrier |    i = *p
  p = & y       |
--------------------------------
Can result in: i = 0

The anomalous behavior is currently only possible on a 21264-based system. And obviously you have to be using one of our multiprocessor servers. Finally, the chances that you actually see it are very low, yet it is possible.

Here is what has to happen for this behavior to show up. Assume T1 runs on P1 and T2 on P2. P2 has to be caching location y with value 0. P1 does y=1 which causes an "invalidate y" to be sent to P2. This invalidate goes into the incoming "probe queue" of P2; as you will see, the problem arises because this invalidate could theoretically sit in the probe queue without doing an MB on P2. The invalidate is acknowledged right away at this point (i.e., you don't wait for it to actually invalidate the copy in P2's cache before sending the acknowledgment). Therefore, P1 can go through its MB. And it proceeds to do the write to p. Now P2 proceeds to read p. The reply for read p is allowed to bypass the probe queue on P2 on its incoming path (this allows replies/data to get back to the 21264 quickly without needing to wait for previous incoming probes to be serviced). Now, P2 can derefence P to read the old value of y that is sitting in its cache (the inval y in P2's probe queue is still sitting there).

How does an MB on P2 fix this? The 21264 flushes its incoming probe queue (i.e., services any pending messages in there) at every MB. Hence, after the read of P, you do an MB which pulls in the inval to y for sure. And you can no longer see the old cached value for y.

Even though the above scenario is theoretically possible, the chances of observing a problem due to it are extremely minute. The reason is that even if you setup the caching properly, P2 will likely have ample opportunity to service the messages (i.e., inval) in its probe queue before it receives the data reply for "read p". Nonetheless, if you get into a situation where you have placed many things in P2's probe queue ahead of the inval to y, then it is possible that the reply to p comes back and bypasses this inval. It would be difficult for you to set up the scenario though and actually observe the anomaly.

The above addresses how current Alpha's may violate what you have shown. Future Alpha's can violate it due to other optimizations. One interesting optimization is value prediction.

Summary

The basic hardware needed to enforce the ordering of dependent loads is already present in all out-of-order processors. But ensuring that this memory ordering is seen by all processors adds additional constraints to handling of cache-line invalidation. And it may add additional constraints in other scenarios as well. However, in practice it seems likely that the potential advantages of the weak Alpha memory model for hardware designers were not worth the cost in software complexity and added overhead of requiring more memory barriers.