Is there any reliable way to force GCC (or any compiler) to factor out runtime size checks in memcpy()
outside of a loop (where that size is not compile-time constant, but constant within that loop), specializing the loop for each relevant size range rather than repeatedly checking the size within it?
This is an test case reduced down from a performance regression reported here for an open source library designed for efficient in-memory analysis of large data sets. (The regression happens to be because of one of my commits...)
The original code is in Cython, but I've reduced it down to a pure C proxy as the following:
void take(double * out, double * in,
int stride_out_0, int stride_out_1,
int stride_in_0, int stride_in_1,
int * indexer, int n, int k)
{
int i, idx, j, k_local;
k_local = k; /* prevent aliasing */
for(i = 0; i < n; ++i) {
idx = indexer[i];
for(j = 0; j < k_local; ++j)
out[i * stride_out_0 + j * stride_out_1] =
in[idx * stride_in_0 + j * stride_in_1];
}
}
The strides are variable; in general the arrays are not even guaranteed to be contiguous (since they might be non-contiguous slices of larger arrays.) However, for the particular case of c-contiguous arrays, I've optimized the above to the following:
void take(double * out, double * in,
int stride_out_0, int stride_out_1,
int stride_in_0, int stride_in_1,
int * indexer, int n, int k)
{
int i, idx, k_local;
assert(stride_out_0 == k);
assert(stride_out_0 == stride_in_0);
assert(stride_out_1 == 1);
assert(stride_out_1 == stride_in_1);
k_local = k; /* prevent aliasing */
for(i = 0; i < n; ++i) {
idx = indexer[i];
memcpy(&out[i * k_local], &in[idx * k_local],
k_local * sizeof(double));
}
}
(The asserts are not present in the original code; instead it checks for contiguity and calls the optimized version if possible, and the unoptimized one if not.)
This version optimizes very well in most cases, since the normal use case if for small n
and large k
. However, the opposite use case does happen as well (large n
and small k
), and it turns out for the particular case of n == 10000
and k == 4
(which cannot be ruled out as representative of an important part of a hypothetical workflow), the memcpy()
version is 3.6x times slower than the original. This is, apparently, mainly due to the fact that k
is not compile-time constant, as evidenced by the fact that this next version performs (almost or exactly, depending on optimization settings) as well as the original (or better, sometimes), for the particular case of k == 4
:
if (k_local == 4) {
/* this optimizes */
for(i = 0; i < n; ++i) {
idx = indexer[i];
memcpy(&out[i * k_local], &in[idx * k_local],
k_local * sizeof(double));
}
} else {
for(i = 0; i < n; ++i) {
idx = indexer[i];
memcpy(&out[i * k_local], &in[idx * k_local],
k_local * sizeof(double));
}
}
Obviously, it's not practical to hardcode a loop for every particular value of k
, so I attempted the following instead (as a first attempt that could later by generalized, if it worked):
if (k_local >= 0 && k_local <= 4) {
/* this does not not optimize */
for(i = 0; i < n; ++i) {
idx = indexer[i];
memcpy(&out[i * k_local], &in[idx * k_local],
k_local * sizeof(double));
}
} else {
for(i = 0; i < n; ++i) {
idx = indexer[i];
memcpy(&out[i * k_local], &in[idx * k_local],
k_local * sizeof(double));
}
}
Unfortunately, this last version is no faster than the original memcpy()
version, which is somewhat disheartening for my faith in GCC's optimization abilities.
Is there any way I can give extra "hints" to GCC (through any means) that will help it do the right thing here? (And even better, are there "hints" that could reliably work across different compilers? This library is compiled for many different targets.)
The results quoted are for GCC 4.6.3 on 32-bit Ubuntu with the "-O2" flag, but I've also tested GCC 4.7.2 and "-O3" versions with similar (but not identical) results. I've posted my test harness to LiveWorkspace, but the timings are from my own machine using the time(1)
command (I don't know how reliable LiveWorkspace timings are.)
EDIT: I've also considered just setting a "magic number" for some minimum size to call memcpy()
with, and I could find such a value with repeated testing, but I'm not sure how generalizable my results would be across different compilers/platforms. Is there any rule of thumb I could use here?
FURTHER EDIT: Realized the k_local
variables are kind of useless in this case, actually, since no aliasing is possible; this was reduced from some experiments I ran where it was possible (k
was global) and I forgot I changed it. Just ignore that part.
EDIT TAG: Realized I can also use C++ in newer versions of Cython, so tagging as C++ in case there's anything that can help from C++...
FINAL EDIT: In lieu (for now) of dropping down to assembly for a specialized memcpy()
, the following seems to be the best empirical solution for my local machine:
int i, idx, j;
double * subout, * subin;
assert(stride_out_1 == 1);
assert(stride_out_1 == stride_in_1);
if (k < 32 /* i.e. 256 bytes: magic! */) {
for(i = 0; i < n; ++i) {
idx = indexer[i];
subout = &out[i * stride_out_0];
subin = &in[idx * stride_in_0];
for(j = 0; j < k; ++j)
subout[j] = subin[j];
}
} else {
for(i = 0; i < n; ++i) {
idx = indexer[i];
subout = &out[i * stride_out_0];
subin = &in[idx * stride_in_0];
memcpy(subout, subin, k * sizeof(double));
}
}
This uses a "magic number" to decide whether to call memcpy()
or not, but still optimizes the case for small arrays that are known to be contiguous (so it's faster than the original, in most cases, since the original makes no such assumption).