I read an article (1.5 years old http://www.drdobbs.com/parallel/cache-friendly-code-solving-manycores-ne/240012736) which talks about cache performance and size of data. They show the following code which they say they ran on an i7 (sandy bridge)
static volatile int array[Size];
static void test_function(void)
{
for (int i = 0; i < Iterations; i++)
for (int x = 0; x < Size; x++)
array[x]++;
}
They make the claim that if they keep Size*Iterations constant, increasing Size, when the size in memory of array increases beyond the L2 cache size they observe a huge spike in time taken to execute (10x).
As an exercise for myself I wanted to try this to see if I could reproduce their results for my machine . (i7 3770k, win7, visual c++ 2012 compiler, Win32 debug mode, no optimizations enabled). To my surprise though, I am not able to see an increase in time taken to execute (even beyond the L3 cache size) which made me think the compiler was somehow optimizing this code. But I dont see any optimizations either. The only change in speed i see is that below the word size of my machine it takes slightly longer. Below are my timings, code listing, and pertinent disassembly.
Does anyone know why:
1) Why the time taken does not increase at all regardless of the size of my array? Or how I could find out?
2) Why does the time taken start high and then decrease until the cache line size is reached, shouldn't more iterations be processed without reading from cache if the data is less than the line size?
Timings:
Size=1,Iterations=1073741824, Time=3829
Size=2,Iterations=536870912, Time=2625
Size=4,Iterations=268435456, Time=2563
Size=16,Iterations=67108864, Time=2906
Size=32,Iterations=33554432, Time=3469
Size=64,Iterations=16777216, Time=3250
Size=256,Iterations=4194304, Time=3140
Size=1024,Iterations=1048576, Time=3110
Size=2048,Iterations=524288, Time=3187
Size=4096,Iterations=262144, Time=3078
Size=8192,Iterations=131072, Time=3125
Size=16384,Iterations=65536, Time=3109
Size=32768,Iterations=32768, Time=3078
Size=65536,Iterations=16384, Time=3078
Size=262144,Iterations=4096, Time=3172
Size=524288,Iterations=2048, Time=3109
Size=1048576,Iterations=1024, Time=3094
Size=2097152,Iterations=512, Time=3313
Size=4194304,Iterations=256, Time=3391
Size=8388608,Iterations=128, Time=3312
Size=33554432,Iterations=32, Time=3109
Size=134217728,Iterations=8, Time=3515
Size=536870912,Iterations=2, Time=3532
code:
#include <string>
#include <cassert>
#include <windows.h>
template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_body(volatile char* array)
{
for (unsigned int i = 0; i < ITERATIONS; i++)
{
for (unsigned int x = 0; x < SIZE; x++)
{
array[x]++;
}
}
}
template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_function()
{
assert(SIZE*ITERATIONS == 1024*1024*1024);
static volatile char array[SIZE];
test_body<SIZE, 1>(array); //warmup
DWORD beginTime = GetTickCount();
test_body<SIZE, ITERATIONS>(array);
DWORD endTime= GetTickCount();
printf("Size=%u,Iterations=%u, Time=%d\n", SIZE,ITERATIONS, endTime-beginTime);
}
int main()
{
enum { eIterations= 1024*1024*1024};
test_function<1, eIterations>();
test_function<2, eIterations/2>();
test_function<4, eIterations/4>();
test_function<16, eIterations/16>();
test_function<32, eIterations/ 32>();
test_function<64, eIterations/ 64>();
test_function<256, eIterations/ 256>();
test_function<1024, eIterations/ 1024>();
test_function<2048, eIterations/ 2048>();
test_function<4096, eIterations/ 4096>();
test_function<8192, eIterations/ 8192>();
test_function<16384, eIterations/ 16384>();
test_function<32768, eIterations/ 32768>();
test_function<65536, eIterations/ 65536>();
test_function<262144, eIterations/ 262144>();
test_function<524288, eIterations/ 524288>();
test_function<1048576, eIterations/ 1048576>();
test_function<2097152, eIterations/ 2097152>();
test_function<4194304, eIterations/ 4194304>();
test_function<8388608, eIterations/ 8388608>();
test_function<33554432, eIterations/ 33554432>();
test_function<134217728, eIterations/ 134217728>();
test_function<536870912, eIterations/ 536870912>();
}
Disassembly
for (unsigned int i = 0; i < ITERATIONS; i++)
00281A59 mov dword ptr [ebp-4],0
00281A60 jmp test_body<536870912,2>+1Bh (0281A6Bh)
00281A62 mov eax,dword ptr [ebp-4]
00281A65 add eax,1
00281A68 mov dword ptr [ebp-4],eax
00281A6B cmp dword ptr [ebp-4],2
00281A6F jae test_body<536870912,2>+53h (0281AA3h)
{
for (unsigned int x = 0; x < SIZE; x++)
00281A71 mov dword ptr [ebp-8],0
00281A78 jmp test_body<536870912,2>+33h (0281A83h)
00281A7A mov eax,dword ptr [ebp-8]
{
for (unsigned int x = 0; x < SIZE; x++)
00281A7D add eax,1
00281A80 mov dword ptr [ebp-8],eax
00281A83 cmp dword ptr [ebp-8],20000000h
00281A8A jae test_body<536870912,2>+51h (0281AA1h)
{
array[x]++;
00281A8C mov eax,dword ptr [array]
00281A8F add eax,dword ptr [ebp-8]
00281A92 mov cl,byte ptr [eax]
00281A94 add cl,1
00281A97 mov edx,dword ptr [array]
00281A9A add edx,dword ptr [ebp-8]
00281A9D mov byte ptr [edx],cl
}
00281A9F jmp test_body<536870912,2>+2Ah (0281A7Ah)
}
00281AA1 jmp test_body<536870912,2>+12h (0281A62h)
I don't get constant time. I modified your code a bit to make it simpler. My times are much lower than yours. I'm not sure why. The large times at the beginning make sense because there are only a few values to write to so it's a dependency chain. The L2 Cache ends at 256k/4=64k. Notice how the values start rising between size=32768 and 65536.
The code:
TL;DR: Your test is not correct test for cache latency or speed. Instead it measures some problems of chopping complex code through OoO CPU pipeline.
Use right tests for measuring cache and memory latency: lat_mem_rd from lmbench; and right tests for speed (bandwidth) measurements: STREAM benchmark for memory speed; tests from memtest86 for cache speed with
rep movsl
main operation)Also, in modern (2010 and newer) desktop/sever CPUs there is hardware prefetch logic built in near L1 and L2 caches which will detect linear access pattern and preload data from outer caches into inner before you will ask for this data: Intel Optimization Manual - 7.2 Hardware prefetching of data, page 365; intel.com blog, 2009. It is hard to disable all hardware prefetches (SO Q/A 1, SO Q/A 2)
Long story:
I will try to do several measurements of similar test with
perf
performance monitoring tool in Linux (akaperf_events
). The code is based on program from Joky (array of 32-bit ints, not of chars), and was separated into several binaries as:a5
is for size 2^5 = 32;a10
=> 2^10 = 1024 (4 KB);a15
=> 2^15 = 32768,a20
(1 million of ints = 4 MB) anda25
(32 millions of ints = 128MB). The cpu is i7-2600 quad-core Sandy Bridge 3.4 GHz.Let's start with basic
perf stat
with default event set (some lines are skipped). I selected 2^10 (4 KB) and 2^20 (4 MB)What we can see here? Instruction counts are very close (because Size*Iterations is constant), cycle count and time are close too. Both examples have 1 billion branches with 99% good prediction. But there is very high 60% stall count for frontend and 5-8% for backend. Frontend stalls are stalls in the instruction fetch and decode, it can be hard to tell why, but for your code frontend can't decode 4 instructions per tick (page B-41 of Intel optimisation manual, section B.3 - "Performance tuning techniques for ... Sandy Bridge", subsection B.3.2 Hierarchical Top-Down Performance Characterization ...)
Or here with raw opcodes (
objdump -d
), some have rather complicated indexing, so possible they can't be handled by 3 simple decoders and waits for the only complex decoder (image is there: http://www.realworldtech.com/sandy-bridge/4/)Backend stalls are stalls created by waiting for memory or cache (the thing you are interested in when measuring caches) and by internal execution core stalls:
Most backend stalls are reported for
add 0x1,%edx
because it is the consumer of data, loaded from the array in previous command. With store to array they account for 70% of backend stalls, or if we multiply if for total backend stall portion in the program (7%), for the 5% of all stalls. Or in other words, the cache is faster than your program. Now we can answer to your first question:You test is so bad (not optimized), that you are trying to measure caches, but they have only 5% slowdown on total run time. Your test is so unstable (noisy) that you will not see this 5% effect.
With custom
perf stat
runs we also can measure cache request-to-miss ratio. For 4 KB program L1 data cache serves 99,99% of all loads and 99,999% of all stores. We can note that your incorrect test generate several times more requests to cache than it is needed to walk on array and increment every element (1 billion loads + 1 billion stores). Additional accesses are for working with local variables likex
, they always served by cache because their address is constant)For 4 MB program hit rate is many-many times worse. 100 times more misses! Now 1.2 % of all memory requests are served not by L1 but L2.
Isn't it a case when we want to notice how cache latency goes from 4 cpu ticks up to 12 (3 times longer), and when this change affects only 1.2% of cache requests, and when our program has only 7% slowdown sensitive to the cache latencies ???
What if we will use bigger data set? Ok, here is a25 (2^25 of 4-byte ints = 128 MB, several times more than cache size):
Almost the same L1 miss rate, and more backend stalls. I was able to get stats on "cache-references,cache-misses" events ans I suggest they are about L3 cache (there is several times more requests to L2):
So, miss rate is high, but the test does 1 billion of (useful) loads, and only 0.08 billion of them misses L1. 0.01 billion of requests are served by memory. Memory latency is around 230 cpu clocks instead of 4 clock L1 latency. Is the test able to see this? May be, if the noise is low.
It seems clear that constant time implies a constant instruction execution rate. To measure cache/RAM speed, data transfer instructions should predominate and results require further clarification than run time, like MB/second and instructions per second. You need something like my BusSpeed benchmark (Google for Roy BusSpeed benchmark or BusSpd2k for source codes and results with versions for Windows, Linux and Android). The original used assembly code with instructions like:
Later versions used C as follows
The benchmark measures MB/second of caches and RAM, including skipped sequential addressing to see where data is read in bursts. Example results follow. Note burst reading effects and reading to two different registers (Reg2, from assembly code version) can be faster than to 1. Then, in this case, loading every word to 1 register (AndI, Reg1, Inc4 bytes) produces almost constant speeds (around 1400 MIPS). So, even a long sequence of instructions might not suit particular pipelines). The way to find out is to run a wider variation of your tests.
######################################################################### Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz Measured 2807 MHz
MM uses 1 and 8 MMX registers, later versions use SSE
Source codes and execution files are free for anyone to play with. Files are in following where array declarations are shown:
Windows http://www.roylongbottom.org.uk/busspd2k.zip
Linux http://www.roylongbottom.org.uk/memory_benchmarks.tar.gz
Results and other links (MP version, Android) are in:
http://www.roylongbottom.org.uk/busspd2k%20results.htm
Some results (OSX, Sandy Bridge):
GCC -O0
GCC -O3
(Note how the first one is really slower, I feel like there may be a mis-speculation somewhere around load-store forwarding...)
Interestingly turning the optimizations on and removing the volatile shows a somehow nicer curve:
To help anyone reproduce the "issue", here is some standard (I hope) C++ code: