Cache friendly matrix shift function

I want to shift the first row of a 2D square matrix to the last row. So if I have a matrix like A, I want to get B.

I can do this using two simple for loops. E.g.

void shift(int M, int N, int A[M][N]){
    int i, j,temp;
    for (i = 1; i < M; i++){
        for (j = 0; j < N; j++){
            temp=A[i][j];
            A[i][j]=A[i-1][j];
            A[i-1][j]=temp;
        }
    }
}

But I want to get as few cache misses as possible. Any tip on how to do that?

标签： c caching optimization matrix

2条回答

戒情不戒烟

2楼-- · 2019-07-01 16:17

I just saw your comment about 4x4 matrices. A 4x4 array of int fits in a single cache line (on modern x86 CPUs, where a cache line is 64B). In that case, you want the compiler to generate something like

## matrix address in [rdi]
movups    xmm0, [rdi]
movups    xmm1, [rdi+16]
movups    xmm2, [rdi+32]
movups    xmm3, [rdi+48]
movups    [rdi],    xmm1     ; doing all the stores after all the loads avoids any possible false dependency
movups    [rdi+16], xmm2
movups    [rdi+32], xmm3
movups    [rdi+48], xmm0

Or maybe fewer AVX 256b loads/stores, but unaligned AVX might do worse. If the array is 64B-aligned, then none of the required loads/stores would cross cache line boundaries. So 2x vmovups ymm loads, and one vmovups ymm store, one vmovups xmm store (to the end), and one vextractf128 store (to the start).

If you're lucky, John's memcpy will optimize away to something like this when the function inlines into a caller that has compile-time-constant values of 4.

For tiny arrays, it's not cache-misses that's the issue, just how to get the whole copy to happen with as little overhead as possible. My ideas below about introducing a level of indirection are not a good idea, because it's really cheap to load all the data and store it back out.

For large matrices:

If you leave room at the end of your matrix for another row, you can just copy the first row to this extra space, and pass a pointer to what was the 2nd row.

This lets you temporarily have a different view of a matrix, but it's not a repeatable process.

If you had a big buffer, you could go on rotating the matrix rows this way until you get to the end of the reserved space and have to copy the array back to the top of the buffer. This minimizes copying overhead but does mean that you're touching some new memory.

If row-copying overhead is a big deal, introducing a level of indirection might be a good idea. Depending on the access pattern of the code that uses it after you shuffle the rows, this might be worse. Instead of a normal 2D array, this might be a use case for an array of pointers to rows.

You can and should allocate the storage for the matrix with one big allocation, instead of allocating each row separately. A C++ std::vector of vectors is not ideal. Initializing your int *rows[M] just takes a loop of &A[i][0], so it's just math, not multiple loads or allocations.

Accessing the array through this indirection table replaces i*N + j math with pointer-chasing: load rows[i], then index that with j.

When you don't need the shuffled view of the array, you can access it directly, but if you want to be able to make permanent shuffles to the array, all users of it always have to go through the indirection layer.

0人赞添加讨论(0) 举报

Cache friendly matrix shift function

For large matrices:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间