I've tried to measure the asymmetric memory access effects of NUMA, and failed.
The Experiment
Performed on an Intel Xeon X5570 @ 2.93GHz, 2 CPUs, 8 cores.
On a thread pinned to core 0, I allocate an array x of size 10,000,000 bytes on core 0's NUMA node with numa_alloc_local. Then I iterate over array x 50 times and read and write each byte in the array. Measure the elapsed time to do the 50 iterations.
Then, on each of the other cores in my server, I pin a new thread and again measure the elapsed time to do 50 iterations of reading and writing to every byte in array x.
Array x is large to minimize cache effects. We want to measure the speed when the CPU has to go all the way to RAM to load and store, not when caches are helping.
There are two NUMA nodes in my server, so I would expect the cores that have affinity on the same node in which array x is allocated to have faster read/write speed. I'm not seeing that.
Why?
Perhaps NUMA is only relevant on systems with > 8-12 cores, as I've seen suggested elsewhere?
http://lse.sourceforge.net/numa/faq/
numatest.cpp
#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>
void pin_to_core(size_t core)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}
std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
for(size_t i=0;i<bm.size;++i)
{
os << numa_bitmask_isbitset(&bm, i);
}
return os;
}
void* thread1(void** x, size_t core, size_t N, size_t M)
{
pin_to_core(core);
void* y = numa_alloc_local(N);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
char c;
for (size_t i(0);i<M;++i)
for(size_t j(0);j<N;++j)
{
c = ((char*)y)[j];
((char*)y)[j] = c;
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;
*x = y;
}
void thread2(void* x, size_t core, size_t N, size_t M)
{
pin_to_core(core);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
char c;
for (size_t i(0);i<M;++i)
for(size_t j(0);j<N;++j)
{
c = ((char*)x)[j];
((char*)x)[j] = c;
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}
int main(int argc, const char **argv)
{
int numcpus = numa_num_task_cpus();
std::cout << "numa_available() " << numa_available() << std::endl;
numa_set_localalloc();
bitmask* bm = numa_bitmask_alloc(numcpus);
for (int i=0;i<=numa_max_node();++i)
{
numa_node_to_cpus(i, bm);
std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
}
numa_bitmask_free(bm);
void* x;
size_t N(10000000);
size_t M(50);
boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
t1.join();
for (size_t i(0);i<numcpus;++i)
{
boost::thread t2(boost::bind(&thread2, x, i, N, M));
t2.join();
}
numa_free(x, N);
return 0;
}
The Output
g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp
./numatest
numa_available() 0 <-- NUMA is available on this system
numa node 0 10101010 12884901888 <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb
numa node 1 01010101 12874584064 <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0
Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428
Elapsed read/write by thread on core 0: 00:00:01.760554
Elapsed read/write by thread on core 1: 00:00:01.719686
Elapsed read/write by thread on core 2: 00:00:01.708830
Elapsed read/write by thread on core 3: 00:00:01.691560
Elapsed read/write by thread on core 4: 00:00:01.686912
Elapsed read/write by thread on core 5: 00:00:01.691917
Elapsed read/write by thread on core 6: 00:00:01.686509
Elapsed read/write by thread on core 7: 00:00:01.689928
Doing 50 iterations reading and writing over array x takes about 1.7 seconds, no matter which core is doing the reading and writing.
Update:
The cache size on my CPUs is 8Mb, so maybe 10Mb array x is not big enough to eliminate cache effecs. I tried 100Mb array x, and I've tried issuing a full memory fence with __sync_synchronize() inside my innermost loops. It still doesn't reveal any asymmetry between NUMA nodes.
Update 2:
I've tried reading and writing to array x with __sync_fetch_and_add(). Still nothing.
The first thing I want to point out is that you might want to double-check which cores are on each node. I don't recall cores and nodes being interleaved like that. Also, you should have 16 threads due to HT. (unless you disabled it)
Another thing:
The socket 1366 Xeon machines are only slightly NUMA. So it will be hard to see the difference. The NUMA effect is much more noticeable on the 4P Opterons.
On systems like yours, the node-to-node bandwidth is actually faster than the CPU-to-memory bandwidth. Since your access pattern is completely sequential, you are getting the full bandwidth regardless of whether or not the data is local. A better thing to measure is the latency. Try random accessing a block of 1 GB instead of streaming it sequentially.
Last thing:
Depending on how aggressively your compiler optimizes, your loop might be optimized out since it doesn't do anything:
Something like this will guarantee that it won't be eliminated by the compiler:
Thanks for this benchmark code. I've taken your 'fixed' version and changed it to pure C + OpenMP and added a few tests for how the memory system behaves under contention. You can find the new code here.
Here are some sample results from a Quad Opteron:
If someone has further improvements, I'd be happy to hear about them. For example, these are obviously not perfect bandwidth measurements in real-world units (likely off by a--hopefully constant--integer factor).
Ah hah! Mysticial is right! Somehow, hardware pre-fetching is optimizing my read/writes.
If it were a cache optimization, then forcing a memory barrier would defeat the optimization:
but that doesn't make any difference. What does make a difference is multiplying my iterator index by prime 1009 to defeat the pre-fetching optimization:
With that change, the NUMA asymmetry is clearly revealed:
At least I think that's what's going on.
Thanks Mysticial!
EDIT: CONCLUSION ~133%
For anyone who is just glancing at this post to get a rough idea of the performance characteristics of NUMA, here is the bottom line according to my tests:
Memory access to a non-local NUMA node has about 1.33 times the latency of memory access to a local node.
If anyone else wants to try this test, here is the modified, working program. I would love to see results from other hardware. This Works On My Machine with Linux 2.6.34-12-desktop, GCC 4.5.0, Boost 1.47.
numatest.cpp
a few comments:
lstopo
utility from the hwloc library. In particular, you'll see which core numbers are members of which NUMA node (processor socket)char
is probably not the ideal data type to measure the maximum RAM throughput. I suspect that using a 32 bit or 64 bit data type, you can get more data through with the same number of cpu cycles.More generally, you should also check that your measurement is not limited by the CPU speed but by the RAM speed. The ramspeed utility for example unrolls the inner loop explicitly to some extent in the source code:
EDIT: on supported architectures
ramsmp
actually even uses 'hand written' assembly code for these loopsL1/L2/L3 Cache effects: It is instructive to measure the bandwidth in GByte/s as function of the block size. You should see that roughly four different speeds when increasing the block size corresponding to where you are reading the data from (caches or main memory). Your processor seems to have 8 MByte of Level3 (?) cache, so your 10 Million bytes might just mostly stay in the L3 cache (which is shared among all cores of one processor).
Memory channels: Your processor has 3 memory channels. If your memory banks are installed such you can exploit them all (see e.g. the motherboard's manual), you may want to run more than one thread at the same time. I saw effects that when reading with one thread only, the asymptotic bandwidth is close to the one of a single memory module (e.g. 12.8 GByte/s for DDR-1600), while when running multiple threads, the asymptotic bandwidth is close to the number of memory channels times the bandwidth of a single memory module.
You can also you use numactl to choose which node to run the process on and where to allocate memory from:
I use this combined with LMbench to get memory latency numbers: