How do data caches route the object in this exampl

2020-04-07 19:47发布

问题:

Consider the diagrammed data cache architecture. (ASCII art follows.)

  --------------------------------------
  | CPU core A | CPU core B |          |
  |------------|------------| Devices  |
  |  Cache A1  |  Cache B1  | with DMA |
  |-------------------------|          |
  |         Cache 2         |          |
  |------------------------------------|
  |                RAM                 |
  --------------------------------------

Suppose that

  • an object is shadowed on a dirty line of Cache A1,
  • an older version of the same object is shadowed on a clean line of Cache 2, and
  • the newest version of the same object has recently been written to RAM via DMA.

Diagram:

  --------------------------------------
  | CPU core A | CPU core B |          |
  |------------|------------| Devices  |
  |  (dirty)   |            | with DMA |
  |-------------------------|          |
  |     (older, clean)      |          |
  |------------------------------------|
  |          (newest, via DMA)         |
  --------------------------------------

Three questions, please.

  1. If CPU core A tries to load (read) the object, what happens?

  2. If, instead, CPU core A tries to store (write) the object, what happens?

  3. Would anything nonobvious, interesting and/or different happen if, rather than core A, core B did the loading or storing?

My questions are theoretical. My questions do not refer to any particular CPU architecture but you may refer to x86 or ARM (or even RISC-V) in your answer if you wish.

Notes. If disregarding snooping would simplify your answer then you may disregard snooping at your discretion. Alternately, you may modify the problem if a modified problem would better illuminate the topic in your opinion. If you must write code to answer, then I would prefer C/C++. You need not name specific flags of a MESI or MOESI protocol in your answer as far as I know, but a simpler, less detailed answer would probably suffice.

Motive. My motive to ask is that I am reading about concurrency and the memory model in the C++ standard. I would like to learn to visualize this model approximately in terms of hardware operations if possible.

UPDATE

To the extent to which I understand, @HadiBrais advises that the following diagrammed architecture would be more usual than the one I have earlier diagrammed, especially if DDIO (see his answer below) is implemented.

  --------------------------------------
  | CPU core A | CPU core B | Devices  |
  |------------|------------| with DMA |
  |  Cache A1  |  Cache B1  |          |
  |------------------------------------|
  |              Cache 2               |
  |------------------------------------|
  |                RAM                 |
  --------------------------------------

回答1:

Your hypothetical system seems to include coherent, write-back L1 caches and non-coherent DMA. A very similar real processor is ARM11 MPCore, except that it doesn't have an L2 cache. However, most modern processors do have coherent DMA. Otherwise, it is the software's responsibility to ensure coherence. The state of the system as shown in your diagram is already incoherent.

If CPU core A tries to load (read) the object, what happens?

It will just read the line held in its local L1 cache. No changes will occur.

If, instead, CPU core A tries to store (write) the object, what happens?

The lines is already in the M coherence state in the L1 cache of core A. So it can write to it directly. No changes will occur.

Would anything nonobvious, interesting and/or different happen if, rather than core A, core B did the loading or storing?

If core B issued a load request to the same line, the L1 cache of core A is snooped and the line is found in the M state. The line is updated in the L2 cache and is sent to the L1 cache of core B. Also one of the following will occur:

  • The line is invalidated from core A's L1 cache. The line is inserted in core B's L1 cache in the E coherence state (in case of the MESI protocol) or the S coherence state (in case of the MSI protocol). If the L2 uses a snoop filter, the filter is updated to indicate that core B has the line in the E/S state. Otherwise, the state of the line in the L2 will be the same as that in core B's L1, except that it doesn't know that it is there (so snoops will have to broadcasted always).
  • The state of the line in core A's L1 cache is changed to S. The line is inserted in core B's L1 cache in the S coherence state. The L2 inserts the line in the S state.

Either way, both L1 caches and the L2 cache will all hold the same copy of the line, which remains incoherent with that in the memory.

If core B issued a store request to the same line, the line will be invalidated from the core A's cache and will end up in the M state in core B's cache.

Eventually, the line will be evicted from the cache hierarchy to make space for other lines. When that happens, there are two cases:

  • The line is in the S/E state, so it will simply be dropped from all caches. Later, if the line is read again, the copy written by the DMA operation will be read from main memory.
  • The line is in the M state, so it will be written back to main memory and (potentially partially) overwrite the copy written by the DMA operation.

Obviously such incoherent state must never occur. It can be prevent by invalidating all relevant line from all caches before the DMA write operation begins and ensuring that no core accesses the area of memory being written to until the operation finishes. The DMA controller sends an interrupt whenever an operation completes. In case of a read DMA operation, all the relevant lines need to be written back to memory to ensure that the most recent values are used.

Intel Data Direct I/O (DDIO) technology enables the DMA controller to read or write directly from the shared last-level cache to improve performance.


This section is not directly related to the question, but I want to write this somewhere.

All commercial x86 CPUs are fully cache coherent (i.e., the whole cache hierarchy is coherent). To be more precise, all processors within the same shared memory domain are cache coherent. In addition, all commercial x86 manycore coprocessors (i.e., Intel Xeon Phi in the PCIe card form) are internally fully coherent. A coprocessor, which is a device on the PCIe interconnect, is not coherent with other coprocessors or CPUs. So a coprocessor resides in a separate coherence domain of its own. I think this is because there is no built-in hardware mechanism to make a PCIe device that has a cache coherent with other PCIe devices or CPUs.

Other than commercial x86 chips, there are prototype x86 chips that are not cache coherent. The only example I'm aware of is Intel's Single-Chip Cloud Computer (SCC), which has later evolved into coherent Xeon Phi.