How do I convince GCC to unroll a loop where the number of iterations is known, but large?
I'm compiling with -O3
.
The real code in question is more complex, of course, but here's a boiled-down example that has the same behavior:
int const constants[] = { 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144 };
int get_sum_1()
{
int total = 0;
for (int i = 0; i < CONSTANT_COUNT; ++i)
{
total += constants[i];
}
return total;
}
...if CONSTANT_COUNT
is defined as 8 (or less) then GCC will unroll the loop, propagate the constants, and reduce the entire function down to a simple return <value>;
. If, on the other hand, CONSTANT_COUNT
is 9 (or greater) then the loop is not unrolled, and GCC produces a binary which loops, reads the constants, and adds them at run-time - even though, in theory, the function could still be optimized down to just returning a constant. (Yes, I've looked at the decompiled the binary.)
If I manually unroll the loop, like this:
int get_sum_2()
{
int total = 0;
total += constants[0];
total += constants[1];
total += constants[2];
total += constants[3];
total += constants[4];
total += constants[5];
total += constants[6];
total += constants[7];
total += constants[8];
//total += constants[9];
return total;
}
Or this:
#define ADD_CONSTANT(z, v, c) total += constants[v];
int get_sum_2()
{
int total = 0;
BOOST_PP_REPEAT(CONSTANT_COUNT, ADD_CONSTANT, _)
return total;
}
...then the function is optimized down to returning a constant. So, GCC appears to be able to handle the constant propagation for larger loops, once unrolled; the hang-up appears to be just getting GCC to consider unrolling the longer loop in the first place.
However, neither hand-unrolling nor BOOST_PP_REPEAT
are viable options, because there are some cases where CONSTANT_COUNT
is a run-time expression, and the same code still needs to work correctly for those cases. (Performance isn't as critical in those cases.)
I'm working in C (not C++) so neither template meta-programming nor constexpr
are available to me.
I've tried -funroll-loops
, -funroll-all-loops
, -fpeel-loops
, and setting large values for max-unrolled-insns
, max-average-unrolled-insns
, max-unroll-times
, max-peeled-insns
, max-peel-times
, max-completely-peeled-insns
, and max-completely-peel-times
, none of which seem to make a difference.
I'm using GCC 4.8.2, on Linux, x86_64.
Any ideas? Is there a flag or parameter I'm missing...?