I have a few questions on STREAM (http://www.cs.virginia.edu/stream/ref.html#runrules) benchmark.
- Below is the comment from stream.c. What is the rationale about the requirement that arrays should be 4 times the size of cache?
* (a) Each array must be at least 4 times the size of the
* available cache memory. I don't worry about the difference
* between 10^6 and 2^20, so in practice the minimum array size
* is about 3.8 times the cache size.
- I originally assume STREAM measures the peak memory bandwidth. But I later found that when I add extra arrays and array accesses, I can get larger bandwidth numbers. So it looks to me that STREAM doesn't guarantee to saturate memory bandwidth. Then my question is what does STREAM really measures and how do you use the numbers reported by STREAM?
For example, I added two extra arrays and make sure to access them together with the original a/b/c arrays. I modify the bytes accounting accordingly. With these two extra arrays, my bandwidth number is bumped up by ~11.5%.
> diff stream.c modified_stream.c
181c181,183
< c[STREAM_ARRAY_SIZE+OFFSET];
---
> c[STREAM_ARRAY_SIZE+OFFSET],
> e[STREAM_ARRAY_SIZE+OFFSET],
> d[STREAM_ARRAY_SIZE+OFFSET];
192,193c194,195
< 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
< 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
---
> 5 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
> 5 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
270a273,274
> d[j] = 3.0;
> e[j] = 3.0;
335c339
< c[j] = a[j]+b[j];
---
> c[j] = a[j]+b[j]+d[j]+e[j];
345c349
< a[j] = b[j]+scalar*c[j];
---
> a[j] = b[j]+scalar*c[j] + d[j]+e[j];
CFLAGS = -O2 -fopenmp -D_OPENMP -DSTREAM_ARRAY_SIZE=50000000
My last level cache is around 35MB.
Any commnet?
Thanks!
This is for a Skylake Linux server.
Memory accesses in modern computers are a lot more complex than one might expect, and it is very hard to tell when the "high-level" model falls apart because of some "low-level" detail that you did not know about before....
The STREAM benchmark code only measures execution time -- everything else is derived. The derived numbers are based on both decisions about what I think is "reasonable" and assumptions about how the majority of computers work. The run rules are the product of trial and error -- attempting to balance portability with generality.
The STREAM benchmark reports "bandwidth" values for each of the kernels. These are simple calculations based on the assumption that each array element on the right hand side of each loop has to be read from memory and each array element on the left hand side of each loop has to be written to memory. Then the "bandwidth" is simply the total amount of data moved divided by the execution time.
There are a surprising number of assumptions involved in this simple calculation.
Additional notes on avoiding "write allocate" traffic:
The key point here, as pointed out by Dr. Bandwidth's answer, is that STREAMS only counts the useful bandwidth seen by the source code. (He's the author of the benchmark.)
In practice the write stream will incur read bandwidth costs as well for the RFO (Read For Ownership) requests. When a CPU want to write 16 bytes (for example) to a cache line, first it has to load the original cache line and then modify it in L1d cache.
(Unless your compiler auto-vectorized with NT stores that bypass cache and avoid that RFO. Some compilers will do that for loops they expect to write an array too larger for cache before any of it is re-read.)
See Enhanced REP MOVSB for memcpy for more about cache-bypassing stores that avoid an RFO.
So increasing the number of read streams vs. write streams will bring software-observed bandwidth closer to actual hardware bandwidth. (Also a mixed read/write workload for the memory may not be perfectly efficient.)
The purpose of the STREAM benchmark is not to measure the peak memory bandwidth (i.e., the maximum memory bandwidth that can be achieved on the system), but to measure the "memory bandwidth" of a number of kernels (COPY, SCALE, SUM, and TRIAD) that are important to the HPC community. So when the bandwidth reported by STREAM is higher, it means that HPC applications will probably run faster on the system.
It's also important to understand the meaning of the term "memory bandwidth" in context of the STREAM benchmark, which is explained in the last section of the documentation. As mentioned in that section, there are at least three ways to count the number of bytes for a benchmark. The STREAM benchmark uses the STREAM method, which count the number of bytes read and written at the source code level. For example, in the SUM kernel (a(i) = b(i) + c(i)), two elements are read and one element is written. Therefore, assuming that all accesses are to memory, the number of bytes accessed from memory per iteration is equal to the number of arrays multiplied by the size of an element (which is 8 bytes). STREAM calculates bandwidth by multiplying the total number of elements accessed (counted using the STREAM method) by the element size and dividing that by the execution time of the kernel. To take run-to-run variations into account, each kernel is run multiple times and the arithmetic average, minimum, and maximum bandwidths are reported.
As you can see, the bandwidth reported by STREAM is not the real memory bandwidth (at the hardware level), so it doesn't even make sense to say that it is the peak bandwidth. In addition, it's almost always much lower than the peak bandwidth. For example, this article shows how ECC and 2MB pages impact the bandwidth reported by STREAM. Writing a benchmark that actually achieves the maximum possible memory bandwidth (at the hardware level) on modern Intel processors is a major challenge and may be a good problem for a whole Ph.D. thesis. In practice, though, the peak bandwidth is less important than the STREAM bandwidth in the HPC domain. (Related: See my answer for information on the issues involved in measuring the memory bandwidth at the hardware level.)
Regarding your first question, notice that STREAM just assumes that all reads and writes are satisfied by the main memory and not by any cache. Allocating an array that is much larger than the size of the LLC helps in making it more likely that this is the case. Essentially, complex and undocumented aspects of the LLC including the replacement policy and the placement policy need to be defeated. It doesn't have to be exactly 4x larger than the LLC. My understanding is that this is what Dr. Bandwidth found to work in practice.