Cache lines, false sharing and alignment

2020-05-26 08:08发布

问题:

I wrote the following short C++ program to reproduce the false sharing effect as described by Herb Sutter:

Say, we want to perform a total amount of WORKLOAD integer operations and we want them to be equally distributed to a number (PARALLEL) of threads. For the purpose of this test, each thread will increment its own dedicated variable from an array of integers, so the process may be ideally parallelizable.

void thread_func(int* ptr)
{
    for (unsigned i = 0; i < WORKLOAD / PARALLEL; ++i)
    {
        (*ptr)++;
    }
}

int main()
{
    int arr[PARALLEL * PADDING];
    thread threads[PARALLEL];

    for (unsigned i = 0; i < PARALLEL; ++i)
    {
        threads[i] = thread(thread_func, &(arr[i * PADDING]));
    }
    for (auto& th : threads)
    {
        th.join();
    }
    return 0;
}

I think the idea is easy to grasp. If you set

#define PADDING 16

every thread will work on a separate cache line (assuming the length of a cache line to be 64 bytes). So the result will be linear increase of speedup until PARALLEL > # cores. If, on the other hand, PADDING is set to any value below 16, one should encounter severe contention, for at least two threads are now likely to operate on the same cache line, which however is protected by a built-in hardware mutex. We would expect our speedup not only to be sublinear in this case, but even to be always < 1, because of the invisible lock convoy.

Now, my first attempts nearly satisfied these expectations, yet the minimum value of PADDING needed to avoid false sharing was around 8 and not 16. I was quite puzzled for about half an hour until I came to the obvious conclusion, that there is no guarantee for my array to be aligned exactly to the beginning of a cache line inside main memory. The actual alignment may vary depending on many conditions, including the size of the array.

In this example, there is of course no need for us to have the array aligned in a special way, because we can just leave PADDING at 16 and everything works out fine. But one could imagine cases, where it does make a difference, whether a certain structure is aligned to a cache line or not. Hence, I added some lines of code to get some information about the actual alignment of my array.

int main()
{
    int arr[PARALLEL * 16];
    thread threads[PARALLEL];
    int offset = 0;

    while (reinterpret_cast<int>(&arr[offset]) % 64) ++offset;
    for (unsigned i = 0; i < PARALLEL; ++i)
    {
        threads[i] = thread(thread_func, &(arr[i * 16 + offset]));
    }
    for (auto& th : threads)
    {
        th.join();
    }
    return 0;
}

Despite this solution worked out fine for me in this case, I'm not sure if it would be a good approach in general. So here is my question:

Is there any common way to have objects in memory aligned to cache lines other than what I did in the above example?

(using g++ MinGW Win32 x86 v.4.8.1 posix dwarf rev3)

回答1:

You should be able to request the required alignment from the compiler:

alignas(64) int arr[PARALELL * PADDING]; // align the array to a 64 byte line


回答2:

gcc supports an aligned keyword: http://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html

You probably want something like this:

int arr[PARALLEL * 16] __attribute__ ((aligned (8)));

This aligns arr to an eight-byte boundary.

Visual Studio has a similar feature, too: http://msdn.microsoft.com/en-us/library/83ythb65.aspx



回答3:

In modern C++ (17 and above) you should use hardware_constructive_interference_size.