Forcing GCC to perform loop unswitching of memcpy

2020-08-09 05:41发布

问题:

Is there any reliable way to force GCC (or any compiler) to factor out runtime size checks in memcpy() outside of a loop (where that size is not compile-time constant, but constant within that loop), specializing the loop for each relevant size range rather than repeatedly checking the size within it?

This is an test case reduced down from a performance regression reported here for an open source library designed for efficient in-memory analysis of large data sets. (The regression happens to be because of one of my commits...)

The original code is in Cython, but I've reduced it down to a pure C proxy as the following:

void take(double * out, double * in,
          int stride_out_0, int stride_out_1,
          int stride_in_0, int stride_in_1,
          int * indexer, int n, int k)
{
    int i, idx, j, k_local;
    k_local = k; /* prevent aliasing */
    for(i = 0; i < n; ++i) {
        idx = indexer[i];
        for(j = 0; j < k_local; ++j)
            out[i * stride_out_0 + j * stride_out_1] =
            in[idx * stride_in_0 + j * stride_in_1];
    }
}

The strides are variable; in general the arrays are not even guaranteed to be contiguous (since they might be non-contiguous slices of larger arrays.) However, for the particular case of c-contiguous arrays, I've optimized the above to the following:

void take(double * out, double * in,
          int stride_out_0, int stride_out_1,
          int stride_in_0, int stride_in_1,
          int * indexer, int n, int k)
{
    int i, idx, k_local;
    assert(stride_out_0 == k);
    assert(stride_out_0 == stride_in_0);
    assert(stride_out_1 == 1);
    assert(stride_out_1 == stride_in_1);
    k_local = k; /* prevent aliasing */
    for(i = 0; i < n; ++i) {
        idx = indexer[i];
        memcpy(&out[i * k_local], &in[idx * k_local],
               k_local * sizeof(double));
    }
}

(The asserts are not present in the original code; instead it checks for contiguity and calls the optimized version if possible, and the unoptimized one if not.)

This version optimizes very well in most cases, since the normal use case if for small n and large k. However, the opposite use case does happen as well (large n and small k), and it turns out for the particular case of n == 10000 and k == 4 (which cannot be ruled out as representative of an important part of a hypothetical workflow), the memcpy() version is 3.6x times slower than the original. This is, apparently, mainly due to the fact that k is not compile-time constant, as evidenced by the fact that this next version performs (almost or exactly, depending on optimization settings) as well as the original (or better, sometimes), for the particular case of k == 4:

    if (k_local == 4) {
        /* this optimizes */
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    }

Obviously, it's not practical to hardcode a loop for every particular value of k, so I attempted the following instead (as a first attempt that could later by generalized, if it worked):

    if (k_local >= 0 && k_local <= 4) {
        /* this does not not optimize */
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    }

Unfortunately, this last version is no faster than the original memcpy() version, which is somewhat disheartening for my faith in GCC's optimization abilities.

Is there any way I can give extra "hints" to GCC (through any means) that will help it do the right thing here? (And even better, are there "hints" that could reliably work across different compilers? This library is compiled for many different targets.)

The results quoted are for GCC 4.6.3 on 32-bit Ubuntu with the "-O2" flag, but I've also tested GCC 4.7.2 and "-O3" versions with similar (but not identical) results. I've posted my test harness to LiveWorkspace, but the timings are from my own machine using the time(1) command (I don't know how reliable LiveWorkspace timings are.)

EDIT: I've also considered just setting a "magic number" for some minimum size to call memcpy() with, and I could find such a value with repeated testing, but I'm not sure how generalizable my results would be across different compilers/platforms. Is there any rule of thumb I could use here?

FURTHER EDIT: Realized the k_local variables are kind of useless in this case, actually, since no aliasing is possible; this was reduced from some experiments I ran where it was possible (k was global) and I forgot I changed it. Just ignore that part.

EDIT TAG: Realized I can also use C++ in newer versions of Cython, so tagging as C++ in case there's anything that can help from C++...

FINAL EDIT: In lieu (for now) of dropping down to assembly for a specialized memcpy(), the following seems to be the best empirical solution for my local machine:

    int i, idx, j;
    double * subout, * subin;
    assert(stride_out_1 == 1);
    assert(stride_out_1 == stride_in_1);
    if (k < 32 /* i.e. 256 bytes: magic! */) {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            subout = &out[i * stride_out_0];
            subin = &in[idx * stride_in_0];
            for(j = 0; j < k; ++j)
                subout[j] = subin[j];
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            subout = &out[i * stride_out_0];
            subin = &in[idx * stride_in_0];
            memcpy(subout, subin, k * sizeof(double));
        }
    }

This uses a "magic number" to decide whether to call memcpy() or not, but still optimizes the case for small arrays that are known to be contiguous (so it's faster than the original, in most cases, since the original makes no such assumption).

回答1:

Ultimately, the issue at hand is one of asking the optimizer to make assumptions about runtime behavior based on multiple variables. While it is possible to provide the optimizer some compile-time hints via the use of 'const' and 'register' declarations on the key variables, ultimately, you're depending on the optimizer to make a lot of assumptions. Further, while the memcpy() may well be intrinsic, it's not guaranteed to be and even if/when it is, the implementation(s) could vary fairly widely.

If the goal is to achieve maximum performance, sometimes you just have to not rely on technology to figure it out for you, rather do it directly. The best advice for this situation is the use of inline assembler to address the problem. Doing so allows you to avoid all of the pitfalls of a "black box" solution compliments of the heuristics of the compiler and optimizer and to finitely state your intent. The key benefit to the use of inline assembler is the ability to avoid any pushes/pops and extraneous "generalization" code in the solution to the memory copy problem and the ability to take direct advantage of the processor's ability to solve the problem. The down side is maintenance, but given that you really need only address Intel and AMD to cover most of the market, it's not insurmountable.

I might add, too, that this solution could well allow you to take advantage of multiple cores/threads and/or a GPU if/when available to do the copying in parallel and truly get a performance gain. While the latency might be higher, the throughput would very likely be much higher, as well. If, for example, you could take advantage of a GPU when present, you could well launch one kernel per copy and copy thousands of elements in a single operation.

The alternative to this is to depend on the compiler/optimizer to make the best guesses for you, use the 'const' and 'register' declarations where you can to offer the compiler hints and use magic numbers to branch based on "best solution" paths... this, however, is going to be exceptionally compiler/system dependent and your mileage will vary widely from one platform/environment to another.



回答2:

SSE/AVX and Alignment

If you're on, for example, a modern-ish Intel processor then use of SSE or AVX instructions is an option. Whilst not specifically about GCC, see this If you're interested and flush with cache I think Intel do a version of their compiler suite for Linux as well as Windows, and I guess that comes with its own suite of libraries.

There's also this post.

Threads (eek)

I've had exactly this sort of problem fairly recently, a memcpy() taking too much time. In my instance it was one big memcpy() (1MByte or so) rather than a lot of smaller ones like you're doing.

I got very good mileage by writing my own multi-threaded memcpy() where the threads were persistent and got 'tasked' with a share of the job by a call my own pmemcpy() function. The persistent threads meant that the overhead was pretty low. I got a x4 improvement for 4 cores.

So if it were possible to break your loops down into a sensible number of threads (I went for one per available core), and you had the luxury of a few spare cores on your machine you might get a similar benefit.

What the real time crowd do - DMA

Just as an aside, I have the pleasure of playing around with some fairly exotic OpenVPX hardware. Basically it's a bunch of boards in a big box with a high speed serial RapidIO interconnect between them. Each board has a DMA engine that drives data across the sRIO to another board's memory.

The vendor I went to is pretty clever at how to maximise the use of a CPU. The clever bit is that the DMA engines are pretty smart - they can be programmed to do things like matrix transformations on the fly, strip mining, things like you're trying to do, etc. And because it's a separate piece of hardware the CPU isn't tied up in the meantime, so that can be busy doing something else.

For example, if you're doing something like Synthetic Aperture Radar processing you always end up doing a big matrix transform. The beauty is that the transform itself takes no CPU time at all - you just move the data to another board and it arrives already transformed.

Anyway, having the benefit of that sort of thing really makes one wish that Intel CPUs (and other) have onboard DMA engines capable of working memory-memory instead of just memory-peripheral. That would make tasks like yours really quick.



回答3:

I think the best way is to experiment and find out the optimal "k" value to switch between the original algorithm (with a loop) and your optimized algorithm using memcpy. The optimal "k" will vary across different CPU's, but shouldn't be drastically different; essentially it's about the overhead of calling memcpy, overhead in memcpy itself in choosing the optimal algorithm (based on size, alignment, etc.) vs. the "naive" algorithm with a loop.

memcpy is an intrinsic in gcc, yes, but it doesn't do magic. What it basically does is that if the size argument is known at compile-time and small-ish (I don't know what the threshold is), then GCC will replace the call to the memcpy function with inline code. If the size argument is not known at compile time, a call to the library function memcpy will always be made.