How do I convince GCC to unroll a loop where the number of iterations is known, but large?
I'm compiling with -O3
.
The real code in question is more complex, of course, but here's a boiled-down example that has the same behavior:
int const constants[] = { 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144 };
int get_sum_1()
{
int total = 0;
for (int i = 0; i < CONSTANT_COUNT; ++i)
{
total += constants[i];
}
return total;
}
...if CONSTANT_COUNT
is defined as 8 (or less) then GCC will unroll the loop, propagate the constants, and reduce the entire function down to a simple return <value>;
. If, on the other hand, CONSTANT_COUNT
is 9 (or greater) then the loop is not unrolled, and GCC produces a binary which loops, reads the constants, and adds them at run-time - even though, in theory, the function could still be optimized down to just returning a constant. (Yes, I've looked at the decompiled the binary.)
If I manually unroll the loop, like this:
int get_sum_2()
{
int total = 0;
total += constants[0];
total += constants[1];
total += constants[2];
total += constants[3];
total += constants[4];
total += constants[5];
total += constants[6];
total += constants[7];
total += constants[8];
//total += constants[9];
return total;
}
Or this:
#define ADD_CONSTANT(z, v, c) total += constants[v];
int get_sum_2()
{
int total = 0;
BOOST_PP_REPEAT(CONSTANT_COUNT, ADD_CONSTANT, _)
return total;
}
...then the function is optimized down to returning a constant. So, GCC appears to be able to handle the constant propagation for larger loops, once unrolled; the hang-up appears to be just getting GCC to consider unrolling the longer loop in the first place.
However, neither hand-unrolling nor BOOST_PP_REPEAT
are viable options, because there are some cases where CONSTANT_COUNT
is a run-time expression, and the same code still needs to work correctly for those cases. (Performance isn't as critical in those cases.)
I'm working in C (not C++) so neither template meta-programming nor constexpr
are available to me.
I've tried -funroll-loops
, -funroll-all-loops
, -fpeel-loops
, and setting large values for max-unrolled-insns
, max-average-unrolled-insns
, max-unroll-times
, max-peeled-insns
, max-peel-times
, max-completely-peeled-insns
, and max-completely-peel-times
, none of which seem to make a difference.
I'm using GCC 4.8.2, on Linux, x86_64.
Any ideas? Is there a flag or parameter I'm missing...?
GCC has a big lot of obscure parameters and program arguments regarding loop unrolling (and optimizations). You could play with
-funroll-loops
,-funroll-all-loops
,--param
name=value, (e.g. with name beingmax-unroll-times
....) etc.Order of arguments to
gcc
matters a lot. You probably want to put-O3
first, and the weird options above after it.However, increasing the unrolling will not always improve the performance.
Last but not least, you could code your own GCC plugin which would change the unrolling criteria.
Cleverly using
__builtin_prefetch
might improve performance, see this answer (but using it carelessly will degrade performance)You need to benchmark. My feeling is that premature micro-optimization is a big lose of your time.
I'm not sure if this workaround is applicable to your actual problem but I found that GCC 4.9.0 20140604 (prerelease) on an x86_64 running Parabola GNU/Linux unrolls the following loop up to and including
CONSTANT_COUNT == 33
.I have only passed it the
-O3
flag. The assembly code forget_sum
is really justI did not try but maybe the pattern can be extended even further.
It seems odd to me that this works since – at least to my human eyes – the code looks much more complex now. Unfortunately, it is a rather intrusive way to force the unrolling. A compiler flag would have been much nicer.