Cache friendly matrix shift function

2019-07-01 15:51发布

I want to shift the first row of a 2D square matrix to the last row. So if I have a matrix like A, I want to get B.

visual of process

I can do this using two simple for loops. E.g.

void shift(int M, int N, int A[M][N]){
    int i, j,temp;
    for (i = 1; i < M; i++){
        for (j = 0; j < N; j++){
            temp=A[i][j];
            A[i][j]=A[i-1][j];
            A[i-1][j]=temp;
        }
    }
}

But I want to get as few cache misses as possible. Any tip on how to do that?

2条回答
戒情不戒烟
2楼-- · 2019-07-01 16:17

I just saw your comment about 4x4 matrices. A 4x4 array of int fits in a single cache line (on modern x86 CPUs, where a cache line is 64B). In that case, you want the compiler to generate something like

## matrix address in [rdi]
movups    xmm0, [rdi]
movups    xmm1, [rdi+16]
movups    xmm2, [rdi+32]
movups    xmm3, [rdi+48]
movups    [rdi],    xmm1     ; doing all the stores after all the loads avoids any possible false dependency
movups    [rdi+16], xmm2
movups    [rdi+32], xmm3
movups    [rdi+48], xmm0

Or maybe fewer AVX 256b loads/stores, but unaligned AVX might do worse. If the array is 64B-aligned, then none of the required loads/stores would cross cache line boundaries. So 2x vmovups ymm loads, and one vmovups ymm store, one vmovups xmm store (to the end), and one vextractf128 store (to the start).

If you're lucky, John's memcpy will optimize away to something like this when the function inlines into a caller that has compile-time-constant values of 4.

For tiny arrays, it's not cache-misses that's the issue, just how to get the whole copy to happen with as little overhead as possible. My ideas below about introducing a level of indirection are not a good idea, because it's really cheap to load all the data and store it back out.


For large matrices:

If you leave room at the end of your matrix for another row, you can just copy the first row to this extra space, and pass a pointer to what was the 2nd row.

This lets you temporarily have a different view of a matrix, but it's not a repeatable process.

If you had a big buffer, you could go on rotating the matrix rows this way until you get to the end of the reserved space and have to copy the array back to the top of the buffer. This minimizes copying overhead but does mean that you're touching some new memory.


If row-copying overhead is a big deal, introducing a level of indirection might be a good idea. Depending on the access pattern of the code that uses it after you shuffle the rows, this might be worse. Instead of a normal 2D array, this might be a use case for an array of pointers to rows.

You can and should allocate the storage for the matrix with one big allocation, instead of allocating each row separately. A C++ std::vector of vectors is not ideal. Initializing your int *rows[M] just takes a loop of &A[i][0], so it's just math, not multiple loads or allocations.

Accessing the array through this indirection table replaces i*N + j math with pointer-chasing: load rows[i], then index that with j.

When you don't need the shuffled view of the array, you can access it directly, but if you want to be able to make permanent shuffles to the array, all users of it always have to go through the indirection layer.

查看更多
相关推荐>>
3楼-- · 2019-07-01 16:19
/* M is the number of rows; N is the number of columns. */
void matrix_shift(int M, int N, int A[M][N]) {
    size_t rowbytes = N * sizeof(int);
    int temprow[N];
    memcpy(temprow, A, rowbytes); // store first row
    memmove(A, A + 1, (M-1) * rowbytes); // shift up
    memcpy(A + (M-1), temprow, rowbytes); // replace last row
}

This keeps it simple and relies on routines which should be highly optimized on any common platform. There is one extra row copied, but this is a minor inefficiency in the stated case of a square matrix.

查看更多
登录 后发表回答