可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am reading a number of papers and they are either using store buffer and store queue interchangeably or they are relating to different structures, and I just cannot follow along. This is what I thought a store queue was:

It is an associatively searchable FIFO queue that keeps information about store instructions in fetch order.
It keeps store addresses and data.
It keeps store instructions' data until the instructions become non-speculative, i.e. they reach retirement stage. Data of a store instruction is sent to the memory (L1 cache in this case) from the store queue only when it reaches retirement stage. This is important since we do not want speculative store data to be written to the memory, because it would mess with the in-order memory state, and we would not be able to fix the memory state in case of a misprediction.
Upon a misprediction, information in the store queue corresponding to store instructions that were fetched after the misprediction instruction are removed.
Load instructions send a read request to both L1 cache and the store queue. If data with the same address is found in the store queue, it is forwarded to the load instruction. Otherwise, data fetched from L1 is used.

I am not sure what a store buffer is, but I was thinking it was just some buffer space to keep data of retired store instructions waiting to be written to the memory (again, L1).

Now, here is why I am getting confused. In this paper, it is stated that "we propose the scalable store buffer [SSB], which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers." I am thinking that the non-scalable associatively searchable conventional structure they are talking about is what I know as a store queue, because they also say that

SSB eliminates the non-scalable associative search of conventional store buffers by forwarding processor-visible/speculative values to loads directly from the L1 cache.

As I mentioned above, as far as I know data forwarding to loads is done through store queue. In the footnote on the first page, it is also stated that

We use "store queue" to refer to storage that holds stores’ values prior to retirement and "store buffer" to refer to storage containing retired store values prior to their release to memory.

This is in line with what I explained above, but then it conflicts with the 'store buffer' in the first quote. The footnote corresponds to one of the references in the paper. In that reference, they say

a store buffer is a mechanism that exists in many current processors to accomplish one or more of the following: store access ordering, latency hiding and data forwarding.

Again, I thought the mechanism accomplishing those is called a store queue. In the same paper they later say

non-blocking caches and buffering structures such as write buffers, store buffers, store queues, and load queues are typically employed.

So, they mention store buffer and store queue separately, but store queue is not mentioned again later. They say

the store buffer maintains the ordering of the stores and allows stores to be performed only after all previous instructions have been completed

and their store buffer model is the same as Mike Johnson's model. In Johnson's book (Superscalar Microprocessor Design), stores first go to store reservation station in fetch order. From there, they are sent to the address unit and from the address unit they are written into a "store buffer" along with their corresponding data. Load forwarding is handled through this store buffer. Once again, I thought this structure was called a store queue. In reference #2, authors also mention that

The Alpha 21264 microprocessor has a 32-entry speculative store buffer where a store remains until it is retired."

I looked at a paper about Alpha 21264, which states that

Stores first transfer their data across the data buses into the speculative store buffer. Store data remains in the speculative store buffer until the stores retire. Once they retire, the data is written into the data cache on idle cache cycles.

Also,

The internal memory system maintains a 32-entry load queue (LDQ) and a 32-entry store queue (STQ) that manages the references while they are in-flight. [...] Stores exit the STQ in fetch order after they retire and dump into the data cache. [...] The STQ CAM logic controls the speculative data buffer. It enables the bypass of speculative store data to loads when a younger load issues after an older store.

So, it sounds like in Alpha 21264 there is a store queue that keeps some information about store instructions in fetch order, but it does not keep data of store instructions. Store instructions' data are kept in the store buffer.

So, after all of this I am not sure what a store buffer is. Is it just an auxiliary structure for a store queue, or is it a completely different structure that stores data which is waiting to be written to L1. Or is it something else? I feel like some authors mean "store queue" when they say "store buffer". Any ideas?

回答1:

You seem to be making a big deal out of names, it's not that critical. A buffer is just some generic storage, which in this particular case should be managed as a queue (to maintain program order as you stated). So it could either be a store buffer (I'm more familiar with this one actually, and see also here), but in other cases it could be described as a store queue (some designs have it combined with the load queue, forming an LSQ).

The names don't matter that much because as you see in your second quote - people may overload them to describe new things. In this particular case, they chose to split the store buffer into 2 parts, divided by the retirement pointer, since they believe they could use it to avoid certain store related stalls in some consistency models. Hey, it's their paper, for the remainder of it they get to define what they want.

One note though - the last bullet of your description of the store buffer/queue seems very architecture specific, forwarding local stores to loads at highest priority may miss later stores to the same address from other threads, and break most of the memory ordering models except the most relaxed ones (unless you protect against that otherwise).

回答2:

Your initial understanding is correct - Store Buffer and Store Queue are distinct terms and distinct hardware structures with different uses. If some authors use them interchangeably, it is plain incorrect.

Store Buffer :

A store buffer is a hardware structure closer to the memory hierarchy and "buffers" up the write traffic (stores) from the processor so that the Write-back stage of the processor is complete as soon as possible.

Depending on whether the cache is write-allocate/write-no-allocate, a write to the cache may take variable amount of cycles. The store buffer essentially decouples the processor pipeline with the memory pipeline. You can read some more info here.

Other use of store buffer is typically in speculative execution. This means that when the processor indulges in speculative execution and commit (like in hardware transactional memory systems, or thers like transmeta crusoe), the hardware must keep track of the specualtive writes and undo them in case of misspeculation. This is where such a processor would use the store buffer for.

Store Queue :

Store Queue is an associate array where the processor stores the data and addresses of the in-flight stores. These are typically used in out-of-order processors for memory disambiguation. The processor needs a Load-Store Queue (LSQ) really to perform the memory disambiguation because it must see through all the memory accesses to the same address before concluding to schedule one memory operation before the other.

All the memory disambiguation logic is accompalished via the Load-Store Queues in an out-of-order processor. Read more about memory disambiguation here

If your confusion is solely because of the paper you refering to, consider asking the authors - it is likely that their use of terminology is mixed up.

回答3:

This is in line with what I explained above, but then it conflicts with the 'store buffer' in the first quote.

There is really no conflict and your understanding seems to be consistent with the they way these terms are used in the paper. Let's carefully go through what the authors have said.

SSB eliminates the non-scalable associative search of conventional store buffers...

The store buffer holds stores that have been retired but are yet to be written into the L1 cache. This necessarily implies that any later issued load is younger in program order with respect to any of the stores in the store buffer. So to check whether the most recent value of the target cache line of the load is still in the store buffer, all that needs to be done is to search the store buffer by the load address. There can be either zero stores that match the load or exactly one store. That is, there cannot be more than one matching store. In the store buffer, for the purpose of forwarding, you only need to keep track of the last store to a cache line (if any) and only compare against that. This is in contrast to the store queue as I will discuss shortly.

...by forwarding processor-visible/speculative values to loads directly from the L1 cache.

In the architecture proposed by the authors, the store buffer and the L1 cache are not in the coherence domain. The L2 is the first structure that is in the coherence domain. Therefore, the L1 contains private values and the authors use it to forward data.

We use "store queue" to refer to storage that holds stores’ values prior to retirement and "store buffer" to refer to storage containing retired store values prior to their release to memory.

Since the store queue holds stores that have not yet been retired, when comparing a load with the store queue, both the address and the age of each store in the queue need to be checked. Then the value is forwarded from the youngest store that is older than the load targeting the same location.

The goal of the paper you cited is to find an efficient way to increase the capacity of the store buffer. It just doesn't make any changes to the store queue because that is not in the scope of the work. However, there is another paper that targets the store queue instead.

a store buffer is a mechanism that exists in many current processors to accomplish one or more of the following: store access ordering, latency hiding and data forwarding.

These features apply to both store buffers and store queues. Using store buffers (and queues) is the most common way to provide these features, but there are others.

In general, though, these terms might be used by different authors or vendors to refer to different things. For example, in the Intel manual, only the store buffer term is used and it holds both non-retired and retired-but-yet-to-be-committed stores (obviously the implementation is much more complicated than just a buffer). In fact, it's possible to have a single buffer for both kinds of stores and use a flag to distinguish between them. In the AMD manual, the terms store buffer, store queue, and write buffer are used interchangeably to refer to the same thing as what Intel calls the store buffer. Although the term write buffer does have a specific meaning in other contexts. If you are reading a document that uses any of these terms without defining them, you'll have to figure out from the context how they are used. In that particular paper you cited, the two terms have been defined precisely. Anyway, I understand that it's easy to get confused because I've been there.