I want to know if a program that I am using and which requires a lot of memory is limited by the memory bandwidth.
When do you expect this to happen? Did it ever happen to you in a real-life scenario?
I found several articles discussing this issue, including:
- http://www.cs.virginia.edu/~mccalpin/papers/bandwidth/node12.html
- http://www.cs.virginia.edu/~mccalpin/papers/bandwidth/node13.html
- http://ispass.org/ucas5/session2_3_ibm.pdf
The first link is a bit old, but suggests that you need to perform less than about 1-40 floating point operations per floating point variable in order to see this effect (correct me if I'm wrong).
How can I measure the memory bandwidth that a given program is using and how do I measure the (peak) bandwidth that my system can offer?
I don't want to discuss any complicated cache issues here. I'm only interested in the communication between the CPU and the memory.
To benchmark your system's memory performance try the STREAM benchmark. Study the benchmark tasks and the results you get carefully since they provide the basic data about your memory that you need to do anything further. You need to figure out the effect(s) of cache(s) -- you do have to understand them -- and when the bandwidth hits a peak.
To figure out the memory performance of your program:
- Measure the execution time for a range of problem sizes.
- Calculate, by hand, how much data your program reads and writes from and to memory for the same range of problem sizes.
- Divide memory use by time.
WARNING: this is an crude approach and should only be used to figure out if you ought to pay attention to memory bandwidth issues. If your crude figuring tells you that your program uses less than 50% of the available memory bandwidth (the figures you got from then STREAM benchmark) then you shouldn't give it any more thought.
This crude approach works best when your program manipulates relatively few very large data structures with simple access patterns. This does describe a lot of high-performance scientific programs but perhaps not a lot of other types of program.
If your program is using virtual memory or if it is doing I/O as it executes, then memory bandwidth is not a problem, not until you sort out disk bandwidth that is.
Finally, yes, every time I run one of our scientific codes the speed of execution is limited by memory bandwidth. As a rule of thumb, if a code executes 10% of the FLOPS that the processor specification promises I'm happy.
Memory intensive applications or applications that require a lot of memory are restricted by:
- Speed of RAM outside the processor
- Speed of cache inside the processor
- Number of entities sharing the
memory bus
- Virtual Memory
Unfortunately, these limitations are not major players in a program's performance. Bigger effects are: Quantity of CPUs, I/O operations and other tasks running with your program. Changing these items will impact your program more than changing items that affect memory bandwidth.
1. Speed of RAM outside the processor
The processor must go outside of its shell and grab instructions and data from RAM. RAM has different speeds at which it can access the cells and return the bits back to the processor. Generally, this is marked in units of Hz. The faster the memory, the less time your process spends load instructions and data and the faster your program executes.
Note: Increasing the speed of the memory beyond the capabilities of the processor will not increase performance. It changes the bottleneck from the RAM to the processor. See also #3.
2. Speed of Cache inside the processor
Cache memory resides inside the shell of the processor. This is one of the fastest types of memory available. Processors will search this memory before searching RAM. Improving the speed and quantity of this memory will improve the performance of your processor, unless other cores are also accessing this memory. For multiple cores accessing memory, there needs to be conflict resolution, which may slow down your applications performance.
Note: There is nothing you can do to speed up or change the size of the cache memory except get another processor. The cache is not something that can be easily changed by human or robotic hands.
3. Number of entities sharing the memory bus
The memory bus is like a highway that entities use to get to the RAM. As with a highway, more lanes means faster throughput (e.g. 16-bit width vs. 32-bit). Many buses also have a speed limit, again the higher the limit, the faster the access. Probably the most notable concept is the number of entities connected to the bus. As with highways, more users slows down the traffic. In most memory buses, only one entity can use it at a time; other entities must wait. Reducing the number of entities that need to use the memory bus will speed up your program.
Some common entities sharing the memory bus: CPU, DMA controllers, Video processors, sound processors and network or I/O processors.
4. Virtual Memory.
Many modern computers employ virtual memory. If the program requires more memory than is available in RAM, the operating system will swap sections of memory with areas on the hard drive. This costs more performance time than reducing memory operating speed. A memory intensive program is more efficient by only using memory allocated to it than all the memory it could need. Reducing these virtual memory swaps will speed up a program.
In summary, there is a maximum speed at which your application can execute. Memory, both internal cache and external RAM, are contributing factors to the upper limit. There are bigger factors that prevent applications from reaching this limit. Some of these factors are I/O operations, and other concurrent tasks. The design and implementation of a program can also contribute to the slowness. More performance can be gained by eliminating I/O operations, concurrent tasks and redesigning the software than by changing the upper limit of memory access speed. Changing these limits will increase your program's performance, but not as drastic as the other techniques.
The broad and general scope of your question makes it nearly impossible to answer in any more than the broadest sense.
You can expect a program to be CPU bound when the number of CPU cycles required to process one cache line of data in less than the number of CPU cycles required to read one cache line and the data set processed is considerably larger than the CPU's data cache. Image processing is one example where this is often the case.
How can I measure the memory bandwidth that a given program is using and how do I measure the (peak) bandwidth that my system can offer?
The first can only be measured (in software) if the CPU supports some kind of performance counter that counts the number of cycles the CPU is stalled because it has to wait for a memory access to complete.
The second can be easily measured, typically be filling/copying large areas of memory. There are countless benchmark programs available which you can use (I haven't used one of those in years, but Sandra and PCMark come to mind. There should be plenty of freeware utilities that do this, too).
Programs that are limited by memory bandwidth have higher memory references (load and/or store operations ) to arithmetic/logic operations. Example is BLAS1 routines like daxpy, ddot, etc.
If the code top routines ( from flat profile) have more arithmetic operation to load/store, then you are not impacted by memory bandwidth much. Example is optimized matrix-matrix multiplication, LINPACK.