I am having Intel Core IvyBridge processor , Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz( L1-32KB,L2-256KB,L3-8MB). I know L3 is inclusive and shared among multiple core. I want to know the following with respect to my system
PART1 :
- L1 is inclusive or exclusive ?
- L2 is inclusive or exclusive ?
PART2 :
If L1 and L2 are both inclusive then to find the access time of L2 we first declare an array(1MB) of size more than L2 cache(256KB) , then start accessing the whole array to load into L2 cache. After that we access the array element from start index to end index with stride of 64B as cache line size is 64B. To get better accurate result we repeat this process(accessing array elements at index ,start-end) for multiple times, say 1 million times and takes the average.
My understanding why this approach gives correct result as follows- When we access the array of size more than L2 cache size, then whole array is loaded from main memory to L3, then from L3 to L2, then L2 to L1. The last 32KB of the whole array is in L1 as it is recently accessed. The whole array is also present in L2 and L3 cache also due to inclusive property and cache coherency . Now, when I start accessing the array again from starting index, which is not in L1 cache, but in L2 cache, so there will be a cache miss and it will be loaded from L2 cache. And this way there will be higher access time required for all elements of whole array and in total I will get the total access time of whole array. To get the single access I will take the average of total no of access .
My question is - Am I correct ?
Thanks in advance .
See section 2.2.5 in the Intel optimization guide -
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
(note that this applies for Sandy-Bridge, but doesn't appear as changed for Ivy-Bridge, which has only minor micro-architectural changes over the previous generation).
So regarding your questions:
Also note that if your benchmark is accessing a data-set larger than the L2, it will probably fail to sit in the L2 (especially if you access it serially and exceed the L2 by more than the size of a single way), and you'd have to fetch it from the L3.