I am trying to understand how lmbench measures latency for L1, L2 and main memory.
The man page for lat_mem_rd mentions the method, but it's not clear to me:
The benchmark runs as two nested loops. The outer loop is the stride size. The inner loop is the array size. For each array size, the benchmark creates a ring of pointers that point forward one stride. Traversing the array is done by
p = (char **)*p;
in a for loop (the over head of the for loop is not significant; the loop is an unrolled loop 1000 loads long). The loop stops after doing a million loads.
How do you "create a ring of pointers that point forward one stride" ? Wouldn't this mean that if the stride size was 128 Bytes, you would need to make a linked list with each node separated by exactly 128 Bytes from it's previous one? malloc just returns some random free piece of memory, so I don't see how that's possible in C. And in the piece of code, I would always get a segmentation fault. (tested it, and what is p supposed to be initialized with?)
There is a similar thread on SO(link) and the first answer discusses this, but it does not talk about how strided approach can be used with linked lists. I also looked at the source code itself (lat_mem_rd.c) but couldn't understand this from that either.
Any help is appreciated.