Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level).
相关问题
- Faster loop: foreach vs some (performance of jsper
- Why wrapping a function into a lambda potentially
- Ado.net performance:What does SNIReadSync do?
- Device support warning : Google play 2019
- Preventing relayout on hiding status bar (faking S
相关文章
- DOM penalty of using html attributes
- Which is faster, pointer access or reference acces
- Django is sooo slow? errno 32 broken pipe? dcramer
- Understanding the difference between Collection.is
- Is it possible to run 16 bit code in an operating
- parallelizing matrix multiplication through thread
- How to determine JS bottlenecks in React Native co
- Difference between SuspendLayout and BeginUpdate
I am not sure about Intel PMU, but I think you can use Intel VTune Amplifier (https://software.intel.com/en-us/intel-vtune-amplifier-xe). This one has a lot of tools for performance monitoring (memory, cpu cache, cpu). Maybe this will work for you.
Yes, this is possible, although it is not necessarily as straightforward as programming the usual PMU counters.
One approach is to use the programmable memory controller counters which are accessed via PCI space. A good place to start is by examining Intel's own implementation in
pcm-memory
at pcm-memory.cpp. This app shows you the per-socket or per-memory-controller throughput, which is suitable for some uses. In particular, the bandwidth is shared among all cores, so on a quiet machine you can assume most of the bandwidth is associated with the process under test, or if you wanted to monitor at the socket level it's exactly what you want.The other alternative is to use careful programming of the "offcore repsonse" counters. These, as far as I know, relate to traffic between the L2 (the last core-private cache) and the rest of the system. You can filter by the result of the offcore response, so you can use a combination of the various "L3 miss" events and multiply by the cache line size to get a read and write bandwidth. The events are quite fine grained, so you can further break it down by the what caused the access in the first place: instruction fetch, data demand requests, prefetching, etc, etc.
The offcore response counters generally lag behind in support by tools like
perf
andlikwid
but at least recent versions seem to have reasonable support, even for client parts like SKL.Yes(ish), indirectly. You can use the relationship between counters (including time stamp) to infer other numbers. For example, if you sample a 1 second interval, and there are N last-level (3) cache misses, you can be pretty confident you are occupying N*CacheLineSize bytes per second.
It gets a bit stickier to relate it accurately to program activity, as those misses might reflect cpu prefetching, interrupt activity, etc.
There is also a morass of ‘this cpu doesn’t count (MMX, SSE, AVX, ..) unless this config bit is in this state’; thus rolling your own is cumbersome....
The offcore response performance monitoring facility can be used to count all core-originated requests on the IDI from a particular core. The request type field can be used to count specific types of requests, such as demand data reads. However, to measure per-core memory bandwidth, the number of requests has to be somehow converted into bytes per second. Most requests are of the cache line size, i.e., 64 bytes. The size of other requests may not be known and could add to the memory bandwidth a number of bytes that is smaller or larger than the size of a cache line. These include cache line-split locked requests, WC requests, UC requests, and I/O requests (but these don't contribute to memory bandwidth), and fence requests that require all pending writes to be completed (
MFENCE
,SFENCE
, and serializing instructions).If you are only interested in cacheable bandwidth, then you can count the number of cacheable requests and multiply that by 64 bytes. This can be very accurate, assuming that cacheable cache line-split locked requests are rare. Unfortunately, writebacks from the L3 (or L4 if available) to memory cannot be counted by the offcore response facility on any of the current microarchitectures. The reason for this is that these writebacks are not core-originated and usually occur as a consequence for a conflict miss in the L3. So the request that missed in the L3 and caused the writeback can be counted, but the offcore response facility does not enable you to determine whether any request to the L3 (or L4) has caused a writeback or not. That's why it's impossible count writebacks to memory "per core."
In addition, offcore response events require a programmable performance counter that is one of 0, 1, 2, or 3 (but not 4-7 when hyptherhtreading is disabled).
Intel Xeon Broadwell support a number of Resource Director Technology (RDT) features. In particular, it supports Memory Bandwidth Monitoring (MBM), which is the only way to measure memory bandwidth accurately per core in general.
MBM has three advantages over offcore response:
The advantage of offcore response is that it supports request type, supplier type, and snoop info fields.
Linux supports MBM starting with kernel version 4.6. On the 4.6 to 4.13, the MBM events are supported in
perf
using the following event names:The events can also be accessed programmatically.
Starting with 4.14, the implementation of RDT in Linux has significantly changed.
On my BDW-E5 (dual socket) system running kernel version 4.16, I can see the byte counts of MBM using the following sequence of commands:
My understanding is that the number of bytes is counted since system reset.
Note that by default, the resource being monitored is the whole socket.
Unfortunately, most of RDT features including MBM turned out to be buggy on Skylake processors that support it. According to SKZ4 and SKX4:
That is why it's disabled by default on Linux when running on Skylake. You can enable MBM by adding the following parameter
rdt=mbmtotal,mbmlocal
to the kernel command line.On the Intel Core 2 microarchitecture, memory bandwidth per core can be measured using the
BUS_TRANS_MEM
event as discussed here.