Determine NUMA layout via latency/performance meas

Recently I have been observing performance effects in memory-intensive workloads I was unable to explain. Trying to get to the bottom of this I started running several microbenchmarks in order to determine common performance parameters like cache line size and L1/L2/L3 cache size (I knew them already, I just wanted to see if my measurements reflected the actual values).

For the cache line test my code roughly looks as follows (Linux C, but the concept is similiar to Windows etc. of course):

char *array = malloc (ARRAY_SIZE);
int count = ARRAY_SIZE / STEP;
clock_gettime(CLOCK_REALTIME, &start_time);

for (int i = 0; i < ARRAY_SIZE; i += STEP) {
  array[i]++;
}
clock_gettime(CLOCK_REALTIME, &end_time);

// calculate time per element here:
[..]

Varying STEP from 1 to 128 shows that from STEP=64 on, I saw that the time per element did not increase further, i.e. every iteration would need to fetch a new cache line dominating the runtime. Varying ARRAY_SIZE from 1K to 16384K keeping STEP=64 I was able to create a nice plot exhibiting a step pattern that roughly corresponds to L1, L2 and L3 latency. It was necessary to repeat the for loop a number of times, for very small array sizes even 100,000s of times, to get reliable numbers, though. Then, on my IvyBridge notebook I can clearly see L1 ending at 64K, L2 at 256K and even the L3 at 6M.

Now on to my real question: In a NUMA system, any single core will obtain remote main memory and even shared cache that is not necessarily as close as its local cache and memory. I was hoping to see a difference in latency/performance thus determining how much memory I could allocate while staying in my fast caches/part of memory.

For this, I refined my test to walk through the memory in 1/10 MB chunks measuring the latency separately and later collect the fastest chunks, roughly like this:

for (int chunk_start = 0; chunk_start < ARRAY_SIZE; chunk_start += CHUNK_SIZE) {
  int chunk_end = MIN (ARRAY_SIZE, chunk_start + CHUNK_SIZE);
  int chunk_els = CHUNK_SIZE / STEP;
  for (int i = chunk_start; i < chunk_end; i+= STEP) {
    array[i]++;
  }
  // calculate time per element
[..]

As soon as I start increasing ARRAY_SIZE to something larger than the L3 size, I get wildy unrealiable numbers not even a large number of repeats is able to even out. There is no way I can make out a pattern usable for performance evaluation with this, let alone determine where exactly a NUMA stripe starts, ends or is located.

Then, I figured the Hardware prefetcher is smart enough to recognize my simple access pattern and simply fetch the needed lines into the cache before I access them. Adding a random number to the array index increases the time per element but did not seem to help much otherwise, probably because I had a rand () call every iteration. Precomputing some random values and storing them in an array did not seem a good idea to me as this array as well would be stored in a hot cache and skew my measurements. Increasing STEP to 4097 or 8193 did not help much either, the prefetcher must be smarter than me.

Is my approach sensible/viable or did I miss the larger picture? Is it possible to observe NUMA latencies like this at all? If yes, what am I doing wrong? I disabled address space randomization just to be sure and preclude strange cache aliasing effects. Is there something else operating-sytem wise that has to be tuned before measuring?

Is it possible to observe NUMA latencies like this at all? If yes, what am I doing wrong?

Memory allocators are NUMA aware, so by default you will not observe any NUMA effects until you explicitly ask to allocate memory on another node. The most simple way to achieve the effect is numactl(8). Just run your application on one node and bind memory allocations to another, like so:

numactl --cpunodebind 0 --membind 1 ./my-benchmark

Data Initialization

Fill the array with random numbers:

static void random_data_init()
{
    for (size_t i = 0; i < ARR_SZ; i++) {
        arr[i] = rand();
    }
}

Benchmark

Perform 1M op operations per one benchmark iteration to reduce measurement noise. Use array random number to jump over few cache lines:

const size_t OPERATIONS = 1 * 1000 * 1000; // 1M operations per iteration

int random_step_sizeK(size_t size)
{
    size_t idx = 0;

    for (size_t i = 0; i < OPERATIONS; i++) {
        arr[idx & (size - 1)]++;
        idx += arr[idx & (size - 1)] * 64; // assuming cache line is 64B
    }
    return 0;
}

Results

Here are the results for i5-4460 CPU @ 3.20GHz:

----------------------------------------------------------------
Benchmark                         Time           CPU Iterations
----------------------------------------------------------------
random_step_sizeK/4         4217004 ns    4216880 ns        166
random_step_sizeK/8         4146458 ns    4146227 ns        168
random_step_sizeK/16        4188168 ns    4187700 ns        168
random_step_sizeK/32        4180545 ns    4179946 ns        163
random_step_sizeK/64        5420788 ns    5420140 ns        129
random_step_sizeK/128       6187776 ns    6187337 ns        112
random_step_sizeK/256       7856840 ns    7856549 ns         89
random_step_sizeK/512      11311684 ns   11311258 ns         57
random_step_sizeK/1024     13634351 ns   13633856 ns         51
random_step_sizeK/2048     16922005 ns   16921141 ns         48
random_step_sizeK/4096     15263547 ns   15260469 ns         41
random_step_sizeK/6144     15262491 ns   15260913 ns         46
random_step_sizeK/8192     45484456 ns   45482016 ns         23
random_step_sizeK/16384    54070435 ns   54064053 ns         14
random_step_sizeK/32768    59277722 ns   59273523 ns         11
random_step_sizeK/65536    63676848 ns   63674236 ns         10
random_step_sizeK/131072   66383037 ns   66380687 ns         11

There are obvious steps between 32K/64K (so my L1 cache is ~32K), 256K/512K (so my L2 cache size is ~256K) and 6144K/8192K (so my L3 cache size is ~6M).