Recently I have been observing performance effects in memory-intensive workloads I was unable to explain. Trying to get to the bottom of this I started running several microbenchmarks in order to determine common performance parameters like cache line size and L1/L2/L3 cache size (I knew them already, I just wanted to see if my measurements reflected the actual values).
For the cache line test my code roughly looks as follows (Linux C, but the concept is similiar to Windows etc. of course):
char *array = malloc (ARRAY_SIZE);
int count = ARRAY_SIZE / STEP;
clock_gettime(CLOCK_REALTIME, &start_time);
for (int i = 0; i < ARRAY_SIZE; i += STEP) {
array[i]++;
}
clock_gettime(CLOCK_REALTIME, &end_time);
// calculate time per element here:
[..]
Varying STEP
from 1 to 128 shows that from STEP=64
on, I saw that the time per element did not increase further, i.e. every iteration would need to fetch a new cache line dominating the runtime.
Varying ARRAY_SIZE
from 1K to 16384K keeping STEP=64
I was able to create a nice plot exhibiting a step pattern that roughly corresponds to L1, L2 and L3 latency. It was necessary to repeat the for loop a number of times, for very small array sizes even 100,000s of times, to get reliable numbers, though. Then, on my IvyBridge notebook I can clearly see L1 ending at 64K, L2 at 256K and even the L3 at 6M.
Now on to my real question: In a NUMA system, any single core will obtain remote main memory and even shared cache that is not necessarily as close as its local cache and memory. I was hoping to see a difference in latency/performance thus determining how much memory I could allocate while staying in my fast caches/part of memory.
For this, I refined my test to walk through the memory in 1/10 MB chunks measuring the latency separately and later collect the fastest chunks, roughly like this:
for (int chunk_start = 0; chunk_start < ARRAY_SIZE; chunk_start += CHUNK_SIZE) {
int chunk_end = MIN (ARRAY_SIZE, chunk_start + CHUNK_SIZE);
int chunk_els = CHUNK_SIZE / STEP;
for (int i = chunk_start; i < chunk_end; i+= STEP) {
array[i]++;
}
// calculate time per element
[..]
As soon as I start increasing ARRAY_SIZE
to something larger than the L3 size, I get wildy unrealiable numbers not even a large number of repeats is able to even out. There is no way I can make out a pattern usable for performance evaluation with this, let alone determine where exactly a NUMA stripe starts, ends or is located.
Then, I figured the Hardware prefetcher is smart enough to recognize my simple access pattern and simply fetch the needed lines into the cache before I access them. Adding a random number to the array index increases the time per element but did not seem to help much otherwise, probably because I had a rand ()
call every iteration. Precomputing some random values and storing them in an array did not seem a good idea to me as this array as well would be stored in a hot cache and skew my measurements. Increasing STEP
to 4097 or 8193 did not help much either, the prefetcher must be smarter than me.
Is my approach sensible/viable or did I miss the larger picture? Is it possible to observe NUMA latencies like this at all? If yes, what am I doing wrong? I disabled address space randomization just to be sure and preclude strange cache aliasing effects. Is there something else operating-sytem wise that has to be tuned before measuring?