L1 memory bandwidth: 50% drop in efficiency using

2019-01-22 15:47发布

站内文章 / C

31 0

做自己的国王

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I want to achieve the maximum bandwidth of the following operations with Intel processors.

for(int i=0; i<n; i++) z[i] = x[i] + y[i]; //n=2048

where x, y, and z are float arrays. I am doing this on Haswell, Ivy Bridge , and Westmere systems.

I originally allocated the memory like this

char *a = (char*)_mm_malloc(sizeof(float)*n, 64);
char *b = (char*)_mm_malloc(sizeof(float)*n, 64);
char *c = (char*)_mm_malloc(sizeof(float)*n, 64);
float *x = (float*)a; float *y = (float*)b; float *z = (float*)c;

When I did this I got about 50% of the peak bandwidth I expected for each system.

The peak values is calculated as frequency * average bytes/clock_cycle. The average bytes/clock cycle for each system is:

Core2: two 16 byte reads one 16 byte write per 2 clock cycles     -> 24 bytes/clock cycle
SB/IB: two 32 byte reads and one 32 byte write per 2 clock cycles -> 48 bytes/clock cycle
Haswell: two 32 byte reads and one 32 byte write per clock cycle  -> 96 bytes/clock cycle

This means that e.g. on Haswell I I only observe 48 bytes/clock cycle (could be two reads in one clock cycle and one write the next clock cycle).

I printed out the difference in the address of b-a and c-b and each are 8256 bytes. The value 8256 is 8192+64. So they are each larger than the array size (8192 bytes) by one cache-line.

On a whim I tried allocating the memory like this.

const int k = 0;
char *mem = (char*)_mm_malloc(1<<18,4096);
char *a = mem;
char *b = a+n*sizeof(float)+k*64;
char *c = b+n*sizeof(float)+k*64;
float *x = (float*)a; float *y = (float*)b; float *z = (float*)c;

This nearly doubled my peak bandwidth so that I now get around 90% of the peak bandwidth. However, when I tried k=1 it dropped back to 50%. I have tried other values of k and found that e.g. k=2, k=33, k=65 only gets 50% of the peak but e.g. k=10, k=32, k=63 gave the full speed. I don't understand this.

In Agner Fog's micrarchitecture manual he says that there is a false dependency with memory address with the same set and offset

It is not possible to read and write simultaneously from addresses that are spaced by a multiple of 4 Kbytes.

But that's exactly where I see the biggest benefit! When k=0 the memory address differ by exactly 2*4096 bytes. Agner also talks about Cache bank conflicts. But Haswell and Westmere are not suppose to have these bank conflicts so that should not explain what I am observing. What's going on!?

I understand that the OoO execution decides which address to read and write so even if the arrays' memory addresses differ by exactly 4096 bytes that does not necessarily mean the processor reads e.g. &x[0] and writes &z[0] at the same time but then why would being off by a single cache line cause it to choke?

Edit: Based on Evgeny Kluev's answer I now believe this is what Agner Fog calls a "bogus store forwarding stall". In his manual under the Pentium Pro, II and II he writes:

Interestingly, you can get a get a bogus store forwarding stall when writing and reading completely different addresses if they happen to have the same set-value in different cache banks:

; Example 5.28. Bogus store-to-load forwarding stall
mov byte ptr [esi], al
mov ebx, dword ptr [esi+4092]
; No stall
mov ecx, dword ptr [esi+4096]
; Bogus stall

Edit: Here is table of the efficiencies on each system for k=0 and k=1.

               k=0      k=1        
Westmere:      99%      66%
Ivy Bridge:    98%      44%
Haswell:       90%      49%

I think I can explain these numbers if I assume that for k=1 that writes and reads cannot happen in the same clock cycle.

       cycle     Westmere          Ivy Bridge           Haswell
           1     read  16          read  16 read  16    read  32 read 32
           2     write 16          read  16 read  16    write 32
           3                       write 16
           4                       write 16  

k=1/k=0 peak    16/24=66%          24/48=50%            48/96=50%

This theory works out pretty well. Ivy bridge is a bit lower than I would expect but Ivy Bridge suffers from bank cache conflicts where the others don't so that may be another effect to consider.

Below is working code to test this yourself. On a system without AVX compile with g++ -O3 sum.cpp otherwise compile with g++ -O3 -mavx sum.cpp. Try varying the value k.

//sum.cpp
#include <x86intrin.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

#define TIMER_TYPE CLOCK_REALTIME

double time_diff(timespec start, timespec end)
{
    timespec temp;
    if ((end.tv_nsec-start.tv_nsec)<0) {
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
    } else {
        temp.tv_sec = end.tv_sec-start.tv_sec;
        temp.tv_nsec = end.tv_nsec-start.tv_nsec;
    }
    return (double)temp.tv_sec +  (double)temp.tv_nsec*1E-9;
}

void sum(float * __restrict x, float * __restrict y, float * __restrict z, const int n) {
    #if defined(__GNUC__)
    x = (float*)__builtin_assume_aligned (x, 64);
    y = (float*)__builtin_assume_aligned (y, 64);
    z = (float*)__builtin_assume_aligned (z, 64);
    #endif
    for(int i=0; i<n; i++) {
        z[i] = x[i] + y[i];
    }
}

#if (defined(__AVX__))
void sum_avx(float *x, float *y, float *z, const int n) {
    float *x1 = x;
    float *y1 = y;
    float *z1 = z;
    for(int i=0; i<n/64; i++) { //unroll eight times
        _mm256_store_ps(z1+64*i+  0,_mm256_add_ps(_mm256_load_ps(x1+64*i+ 0), _mm256_load_ps(y1+64*i+  0)));
        _mm256_store_ps(z1+64*i+  8,_mm256_add_ps(_mm256_load_ps(x1+64*i+ 8), _mm256_load_ps(y1+64*i+  8)));
        _mm256_store_ps(z1+64*i+ 16,_mm256_add_ps(_mm256_load_ps(x1+64*i+16), _mm256_load_ps(y1+64*i+ 16)));
        _mm256_store_ps(z1+64*i+ 24,_mm256_add_ps(_mm256_load_ps(x1+64*i+24), _mm256_load_ps(y1+64*i+ 24)));
        _mm256_store_ps(z1+64*i+ 32,_mm256_add_ps(_mm256_load_ps(x1+64*i+32), _mm256_load_ps(y1+64*i+ 32)));
        _mm256_store_ps(z1+64*i+ 40,_mm256_add_ps(_mm256_load_ps(x1+64*i+40), _mm256_load_ps(y1+64*i+ 40)));
        _mm256_store_ps(z1+64*i+ 48,_mm256_add_ps(_mm256_load_ps(x1+64*i+48), _mm256_load_ps(y1+64*i+ 48)));
        _mm256_store_ps(z1+64*i+ 56,_mm256_add_ps(_mm256_load_ps(x1+64*i+56), _mm256_load_ps(y1+64*i+ 56)));
    }
}
#else
void sum_sse(float *x, float *y, float *z, const int n) {
    float *x1 = x;
    float *y1 = y;
    float *z1 = z;
    for(int i=0; i<n/32; i++) { //unroll eight times
        _mm_store_ps(z1+32*i+  0,_mm_add_ps(_mm_load_ps(x1+32*i+ 0), _mm_load_ps(y1+32*i+  0)));
        _mm_store_ps(z1+32*i+  4,_mm_add_ps(_mm_load_ps(x1+32*i+ 4), _mm_load_ps(y1+32*i+  4)));
        _mm_store_ps(z1+32*i+  8,_mm_add_ps(_mm_load_ps(x1+32*i+ 8), _mm_load_ps(y1+32*i+  8)));
        _mm_store_ps(z1+32*i+ 12,_mm_add_ps(_mm_load_ps(x1+32*i+12), _mm_load_ps(y1+32*i+ 12)));
        _mm_store_ps(z1+32*i+ 16,_mm_add_ps(_mm_load_ps(x1+32*i+16), _mm_load_ps(y1+32*i+ 16)));
        _mm_store_ps(z1+32*i+ 20,_mm_add_ps(_mm_load_ps(x1+32*i+20), _mm_load_ps(y1+32*i+ 20)));
        _mm_store_ps(z1+32*i+ 24,_mm_add_ps(_mm_load_ps(x1+32*i+24), _mm_load_ps(y1+32*i+ 24)));
        _mm_store_ps(z1+32*i+ 28,_mm_add_ps(_mm_load_ps(x1+32*i+28), _mm_load_ps(y1+32*i+ 28)));
    }
}
#endif

int main () {
    const int n = 2048;
    const int k = 0;
    float *z2 = (float*)_mm_malloc(sizeof(float)*n, 64);

    char *mem = (char*)_mm_malloc(1<<18,4096);
    char *a = mem;
    char *b = a+n*sizeof(float)+k*64;
    char *c = b+n*sizeof(float)+k*64;

    float *x = (float*)a;
    float *y = (float*)b;
    float *z = (float*)c;
    printf("x %p, y %p, z %p, y-x %d, z-y %d\n", a, b, c, b-a, c-b);

    for(int i=0; i<n; i++) {
        x[i] = (1.0f*i+1.0f);
        y[i] = (1.0f*i+1.0f);
        z[i] = 0;
    }
    int repeat = 1000000;
    timespec time1, time2;

    sum(x,y,z,n);
    #if (defined(__AVX__))
    sum_avx(x,y,z2,n);
    #else
    sum_sse(x,y,z2,n);
    #endif
    printf("error: %d\n", memcmp(z,z2,sizeof(float)*n));

    while(1) {
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__AVX__))
        for(int r=0; r<repeat; r++) sum_avx(x,y,z,n);
        #else
        for(int r=0; r<repeat; r++) sum_sse(x,y,z,n);
        #endif
        clock_gettime(TIMER_TYPE, &time2);

        double dtime = time_diff(time1,time2);
        double peak = 1.3*96; //haswell @1.3GHz
        //double peak = 3.6*48; //Ivy Bridge @ 3.6Ghz
        //double peak = 2.4*24; // Westmere @ 2.4GHz
        double rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("dtime %f, %f GB/s, peak, %f, efficiency %f%%\n", dtime, rate, peak, 100*rate/peak);
    }
}

回答1:

I think the gap between a and b does not really matter. After leaving only one gap between b and c I've got the following results on Haswell:

Since Haswell is known to be free of bank conflicts, the only remaining explanation is false dependence between memory addresses (and you've found proper place in Agner Fog's microarchitecture manual explaining exactly this problem). The difference between bank conflict and false sharing is that bank conflict prevents accessing the same bank twice during the same clock cycle while false sharing prevents reading from some offset in 4K piece of memory just after you've written something to same offset (and not only during the same clock cycle but also for several clock cycles after the write).

Since your code (for k=0) writes to any offset just after doing two reads from the same offset and would not read from it for a very long time, this case should be considered as "best", so I placed k=0 at the end of the table. For k=1 you always read from offset that is very recently overwritten, which means false sharing and therefore performance degradation. With larger k time between write and read increases and CPU core has more chances to pass written data through all memory hierarchy (which means two address translations for read and write, updating cache data and tags and getting data from cache, data synchronization between cores, and probably many more stuff). k=12 or 24 clocks (on my CPU) is enough for every written piece of data to be ready for subsequent read operations, so starting with this value performance gets back to usual. Looks not very different from 20+ clocks on AMD (as said by @Mysticial).

回答2:

First, observe that the three arrays have a total size of 24KB. In addition, since you're initializing the arrays before executing the main loop, most accesses in the main loop will hit into the L1D, which is 32KB in size and 8-way associative on modern Intel processors. So we don't have to worry about misses or hardware prefetching. The most important performance event in this case is LD_BLOCKS_PARTIAL.ADDRESS_ALIAS, which occurs when a partial address comparison involving a later load results in a match with an earlier store and all of the conditions of store forwarding are satisfied, but the target locations are actually different. Intel refers to this situation as 4K aliasing or false store forwarding. The observable performance penalty of 4K aliasing depends on the surrounding code.

By measuring cycles, LD_BLOCKS_PARTIAL.ADDRESS_ALIAS and MEM_UOPS_RETIRED.ALL_LOADS, we can see that for all values of k where the achieved bandwidth is much smaller than the peak bandwidth, LD_BLOCKS_PARTIAL.ADDRESS_ALIAS and MEM_UOPS_RETIRED.ALL_LOADS are almost equal. Also for all values of k where the achieved bandwidth is close to the peak bandwidth, LD_BLOCKS_PARTIAL.ADDRESS_ALIAS is very small compared to MEM_UOPS_RETIRED.ALL_LOADS. This confirms that bandwidth degradation is occurring due to most loads suffering from 4K aliasing.

The Intel optimization manual Section 12.8 says the following:

4-KByte memory aliasing occurs when the code stores to one memory location and shortly after that it loads from a different memory location with a 4-KByte offset between them. For example, a load to linear address 0x400020 follows a store to linear address 0x401020.

The load and store have the same value for bits 5 - 11 of their addresses and the accessed byte offsets should have partial or complete overlap.

That is, there are two conditions for a later load to alias with an earlier store:

Bits 5-11 of the two linear addresses must be equal.
The accessed locations must overlap (so that there can be some data to forward).

Only then 4K aliasing may occur.

However, this appears to not be accurate. The following listing shows the AVX-based (32-byte) sequence of memory accesses and the least significant 12 bits of their addresses for different values of k.

======
k=0
======
load x+(0*64+0)*4  = x+0 where x is 4k aligned    0000 0000 0000
load y+(0*64+0)*4  = y+0 where y is 4k aligned    0000 0000 0000
store z+(0*64+0)*4 = z+0 where z is 4k aligned    0000 0000 0000
load x+(0*64+8)*4  = x+32 where x is 4k aligned   0000 0010 0000
load y+(0*64+8)*4  = y+32 where y is 4k aligned   0000 0010 0000
store z+(0*64+8)*4 = z+32 where z is 4k aligned   0000 0010 0000
load x+(0*64+16)*4 = x+64 where x is 4k aligned   0000 0100 0000
load y+(0*64+16)*4 = y+64 where y is 4k aligned   0000 0100 0000
store z+(0*64+16)*4= z+64 where z is 4k aligned   0000 0100 0000
load x+(0*64+24)*4  = x+96 where x is 4k aligned  0000 0110 0000
load y+(0*64+24)*4  = y+96 where y is 4k aligned  0000 0110 0000
store z+(0*64+24)*4 = z+96 where z is 4k aligned  0000 0110 0000
load x+(0*64+32)*4 = x+128 where x is 4k aligned  0000 1000 0000
load y+(0*64+32)*4 = y+128 where y is 4k aligned  0000 1000 0000
store z+(0*64+32)*4= z+128 where z is 4k aligned  0000 1000 0000
.
.
.
======
k=1
======
load x+(0*64+0)*4  = x+0 where x is 4k aligned       0000 0000 0000
load y+(0*64+0)*4  = y+0 where y is 4k+64 aligned    0000 0100 0000
store z+(0*64+0)*4 = z+0 where z is 4k+64 aligned    0000 0100 0000
load x+(0*64+8)*4  = x+32 where x is 4k aligned      0000 0010 0000
load y+(0*64+8)*4  = y+32 where y is 4k+64 aligned   0000 0110 0000
store z+(0*64+8)*4 = z+32 where z is 4k+64 aligned   0000 0110 0000
load x+(0*64+16)*4 = x+64 where x is 4k aligned      0000 0100 0000
load y+(0*64+16)*4 = y+64 where y is 4k+64 aligned   0000 1000 0000
store z+(0*64+16)*4= z+64 where z is 4k+64 aligned   0000 1000 0000
load x+(0*64+24)*4  = x+96 where x is 4k aligned     0000 0110 0000
load y+(0*64+24)*4  = y+96 where y is 4k+64 aligned  0000 1010 0000
store z+(0*64+24)*4 = z+96 where z is 4k+64 aligned  0000 1010 0000
load x+(0*64+32)*4 = x+128 where x is 4k aligned     0000 1000 0000
load y+(0*64+32)*4 = y+128 where y is 4k+64 aligned  0000 1100 0000
store z+(0*64+32)*4= z+128 where z is 4k+64 aligned  0000 1100 0000
.
.
.

======
k=10
======
load x+(0*64+0)*4  = x+0 where x is 4k aligned        0000 0000 0000
load y+(0*64+0)*4  = y+0 where y is 4k+640 aligned    0000 1000 0000
store z+(0*64+0)*4 = z+0 where z is 4k+640 aligned    0010 1000 0000
load x+(0*64+8)*4  = x+32 where x is 4k aligned       0000 0010 0000
load y+(0*64+8)*4  = y+32 where y is 4k+640 aligned   0010 1010 0000
store z+(0*64+8)*4 = z+32 where z is 4k+640 aligned   0010 1010 0000
load x+(0*64+16)*4 = x+64 where x is 4k aligned       0000 0100 0000
load y+(0*64+16)*4 = y+64 where y is 4k+640 aligned   0010 1100 0000
store z+(0*64+16)*4= z+64 where z is 4k+640 aligned   0010 1100 0000
load x+(0*64+24)*4  = x+96 where x is 4k aligned      0000 0110 0000
load y+(0*64+24)*4  = y+96 where y is 4k+640 aligned  0010 1110 0000
store z+(0*64+24)*4 = z+96 where z is 4k+640 aligned  0010 1110 0000
load x+(0*64+32)*4 = x+128 where x is 4k aligned      0000 1000 0000
load y+(0*64+32)*4 = y+128 where y is 4k+640 aligned  0011 0000 0000
store z+(0*64+32)*4= z+128 where z is 4k+640 aligned  0011 0000 0000
.
.
.

On my Haswell processor, the efficiencies for different values of k are the following:

k  %
-----
0  86
1  48
2  48
3  48
4  48
5  48
6  58
7  65
8  68
9  75
10 86
11 86
12 86
.  (same)
.
.
64 86
65 48 (repeat the same sequence)
.
.
.

where 86% is the highest and 48% is the lowest.

Examine bits 5-11 of the accessed locations for the three different cases (different values of k). When k is 0 or 10 (or larger I think), there is no load whose bits 5-11 of its address match that of any recent previous store. This violates the first condition of 4K aliasing. The results are consistent with the manual. However, when k is between 0 and 10, it's difficult to see what's happening. We need a simulator.

I wrote a program that basically emulates the addresses of the memory accesses and calculates the total number of loads that suffered 4K aliasing for different values of k. One issue I faced was that we don't know, for any given load, the number of stores that are still in the store buffer (have not been committed yet). Therefore, to perform a comprehensive analysis, I tried with different store buffer occupancies. I assume that at any point in time, there are exactly a fixed number of stores in the store buffer. So the current load needs to be compared against all of these stores.

#include <iostream>
#include <list>
#include <algorithm>

int main() {
    std::list<int> addresses;
    int aliasCount = 0, k, i, j, occupancy;

    // "occupancy" represents the number of stores that can be compared against
    // to perform (false) store forwarding. This is unknown, so try a wide range of values.
    for (occupancy = 1; occupancy < 32; ++occupancy)
    {
        for (k = 0; k < 64; ++k)
        {

            addresses.clear();

            // Assume that at any point in time, there
            // are exactly "occupancy" many stores in the store buffer,
            // which are the most recent 4 stores.
            // 0x00f is a dummy value that will never match any other.
            for (int temp = 0; temp < occupancy; ++temp)
            addresses.push_back(0x00f);

            for (i = 0; i < 32; ++i)
            {
                for (j = 0; j < 8; ++j)
                {
                    // We only need to compare bits 5-11.
                    // 0xfe0 = 1111 1110 0000
                    int xaddr = 0xfe0 & ((64 * i + 8 * j) * 4);
                    int yaddr = 0xfe0 & (((64 * i + 8 * j) * 4) + (64 * k));
                    int zaddr = 0xfe0 & (((64 * i + 8 * j) * 4) + (64 * k));

                    if (std::find(addresses.cbegin(), addresses.cend(), xaddr) !=
                        addresses.end())
                        ++aliasCount;
                    if (std::find(addresses.cbegin(), addresses.cend(), yaddr) !=
                        addresses.end())
                        ++aliasCount;
                    addresses.push_back(zaddr); // Add the new store to the store buffer.
                    addresses.pop_front(); // Remove the olderst store from the buffer.
                }
            }

            std::cout << occupancy << ", " << k << ", " << aliasCount << "\n";
            aliasCount = 0;
        }
    }

    return 0;
}

You can find the raw results here. You can then create an interactive 3D plot of the results by simply copying them into the data box here and click on the 'Draw Graph" button.

By using the LD_BLOCKS_PARTIAL.ADDRESS_ALIAS performance counter, we have determined that the bandwidth is sub-optimal for k larger than 0 and smaller than 10. Hence, we need to find the smallest store buffer occupancy that achieves this behavior. From the results, it turns out that it is 18. That is, the most recent 18 stores need to be in the store buffer so that when k=10 or k=0, the L1D bandwidth is optimal and sub-optimal otherwise.

occupancy, k, aliasCount
18, 0, 0
18, 1, 254
18, 2, 252
18, 3, 250
18, 4, 248
18, 5, 246
18, 6, 244
18, 7, 242
18, 8, 240
18, 9, 238
18, 10, 0
18, 11, 0

Note that the total number of 32-byte loads is 512.

The change in aliasCount is consistent with the change in LD_BLOCKS_PARTIAL.ADDRESS_ALIAS, but it changes much slower. That is, aliasCount decreases until it reaches the optimal value when k = 10, but there is a steep drop from k = 9.

@BeeOnRope suggested to:

simulate a store throughput that is slightly less than 1 store per cycle. This enables us to remove the occupancy parameter.
if one of a pair of loads aliased a previous store due to a match in bits 5-11, then consider the load to be an alias as well. This could happen due to a flaw in ADDRESS_ALIAS rather than due to actual 4K aliasing.

By implementing both (the code is here), the results are as follows:

k, aliasCount
0, 0
1, 490
2, 446
3, 402
4, 358
5, 314
6, 270
7, 226
8, 182
9, 138
10, 94
11, 50
12, 6
13, 0

Much better! Although there seems that there is still room for improvement in the simulator.

The following graph plots ADDRESS_ALIAS / MEM_LOADS vs. k in three cases:

Both X and Y: _mm256_store_ps(z1+64*i+ 0,_mm256_add_ps(_mm256_load_ps(x1+64*i+ .), _mm256_load_ps(y1+64*i+ .)));.
Only X: _mm256_store_ps(z1+64*i+ 0,_mm256_add_ps(_mm256_load_ps(x1+64*i+ .), _mm256_set_ps(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0)));. In this case, MEM_LOADS is halved.
Only Y: _mm256_store_ps(z1+64*i+ 0,_mm256_add_ps(_mm256_set_ps(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0), _mm256_load_ps(y1+64*i+ .)));. In this case, MEM_LOADS is halved.
Sim: measurements made using the simulator.

The weird thing here is that the loads from y should never alias any store, but ADDRESS_ALIAS shows otherwise.

I've improved the simulator so that it can simulate different store throughputs for different values of k, which seems to reflect very well what's actually happening. An explanation of why the store throughputs might be so wildly different even though most of them hit in the L1 is still needed.

The following store throughputs were used:

k    th
-------
0    (doesn't matter)
1    0.83
2    0.83
3    0.86   
4    0.87
5    0.88
6    0.93
7    0.93
8    0.93
9    0.93
10   0.93 (this actually does matter)
11   0.93
12   0.93

For example, a value of 0.83 means that the probability that a store will leave the store buffer (be committed) in any given cycle is 83%.

I've conducted additional experiments to determine whether the two conditions for 4K aliasing to occur as described in the Intel manual are accurate and whether LD_BLOCKS_PARTIAL.ADDRESS_ALIAS works as expected. Consider the following loop body:

; Initially, bits 11-0 of `rsi` are zeroes and bits 11-0 of `r13` are zeroes except for the least significant bit.
add rsi, 64
add r13, 64
mov  r11, [rsi-56]
mov  [r13-56], r11
mov  r11, [rsi-48]
sfence ; To minimize standard deviation, it's better to execute the fences every two iterations.
lfence ; lfence is not ordered with sfence, but cpuid seems to have the same effect on the event counter in this particular case.

LD_BLOCKS_PARTIAL.ADDRESS_ALIAS in this case increases with the number of iterations, indicating that there is aliasing. To determine which of the two loads alias with a previous store, the same experiment can be repeat with one of the loads removed from the code. Without the first load, LD_BLOCKS_PARTIAL.ADDRESS_ALIAS still increase with the number of iterations. In addition, the number of events is a little less than the number of loads. Without the second load, however, LD_BLOCKS_PARTIAL.ADDRESS_ALIAS is fixed at some small value (with high standard deviation) no matter what the number of iterations is. This indicates that the second load is responsible for most or all of the aliasing events. Let's examine the addresses:

load 64-56=8=     0000 0000 1000
store 65-56=9=    0000 0000 1001
load 64-48=16=    0000 0001 0000

load 128-56=72=   0000 0100 1000
store 129-56=73=  0000 0100 1001
load 128-48=80=   0000 0101 0000

load 192-56=136=  0000 1000 1000
store 193-56=137= 0000 1000 1001
load 192-48=144=  0000 1001 0000

load 256-56=200=  0000 1100 1000
store 257-56=201= 0000 1100 1001
load 256-48=208=  0000 1101 0000

load 320-56=264=  0001 0000 1000
store 321-56=265= 0001 0000 1001
load 320-48=272=  0001 0001 0000

load 384-56=328=  0001 0100 1000
store 385-56=329= 0001 0100 1001
load 384-48=336=  0001 0101 0000

load 448-56=392=  0001 1000 1000
store 449-56=393= 0001 1000 1001
load 448-48=400=  0001 1001 0000

load 502-56=446=  0001 1100 1000
store 503-56=447= 0001 1100 1001
load 502-48=454=  0001 1101 0000

Remember that the size of each access is 8 bytes. Consider the first store-load-load group. The store targets the 8-byte range 0000 0000 1001 - 0000 0001 0000. The first load targets the 8-byte range 0000 0001 0000 - 0000 0001 0111. These two ranges overlap (bits 5-11 match and bits 0-4 overlap). The second load targets the 8-byte range 0000 0100 1000 - 0000 0100 1111. The range of this load does not overlap with the range of the previous store. In fact, it does not overlap with any previous store to the same page. This reasoning applies to all other values of bits 0-11 of r13 except when an access crosses a cache line or page boundary.

TBC.