When compiling with gcc -O3
, why does the following loop not vectorize (automatically):
#define SIZE (65536)
int a[SIZE], b[SIZE], c[SIZE];
int foo () {
int i, j;
for (i=0; i<SIZE; i++){
for (j=i; j<SIZE; j++) {
a[i] = b[i] > c[j] ? b[i] : c[j];
}
}
return a[0];
}
when the following one does?
#define SIZE (65536)
int a[SIZE], b[SIZE], c[SIZE];
int foov () {
int i, j;
for (i=0; i<SIZE; i++){
for (j=i; j<SIZE; j++) {
a[i] += b[i] > c[j] ? b[i] : c[j];
}
}
return a[0];
}
The only difference is whether the result of the expression in the inner loop is assigned to a[i], or added to a[i].
For reference -ftree-vectorizer-verbose=6
gives the following output for the first (non-vectorizing) loop.
v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: not vectorized: live stmt not supported: D.2700_5 = c[j_20];
v.c:5: note: vectorized 0 loops in function.
And the same output for the loop that vectorizes is:
v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: vect_model_load_cost: aligned.
v.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
v.c:9: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
v.c:9: note: vect_model_reduction_cost: inside_cost = 1, outside_cost = 6 .
v.c:9: note: cost model: prologue peel iters set to vf/2.
v.c:9: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
v.c:9: note: Cost model analysis:
Vector inside of loop cost: 3
Vector outside of loop cost: 27
Scalar iteration cost: 3
Scalar outside cost: 7
prologue iterations: 2
epilogue iterations: 2
Calculated minimum iters for profitability: 8
v.c:9: note: Profitability threshold = 7
v.c:9: note: Profitability threshold is 7 loop iterations.
v.c:9: note: LOOP VECTORIZED.
v.c:5: note: vectorized 1 loops in function.
The first loop is equivalent to
The problem with the original expression is that it really doesn't make that much sense, so it's not too surprising gcc isn't able to vectorize it.
In the first case: the code overwrites the same memory location
a[i]
in each iteration. This inherently sequentializes the loop as the loop iterations are not independent.(In reality, only the final iteration is actually needed. So the entire inner loop could be taken out.)
In the second case: GCC recognizes the loop as a reduction operation - for which it has special case handling to vectorize.
Compiler vectorization is often implemented as some sort of "pattern matching". Meaning the compiler analyzes code to see if it fits a certain pattern that it's able to vectorize. If it does, it gets vectorized. If it doesn't, then it doesn't.
This seems to be a corner case where the first loop doesn't fit any of the pre-coded patterns that GCC can handle. But the second case fits the "vectorizable reduction" pattern.
Here's the relevant part of GCC's source code that spits out that
"not vectorized: live stmt not supported: "
message:http://svn.open64.net/svnroot/open64/trunk/osprey-gcc-4.2.0/gcc/tree-vect-analyze.c
From just the line:
It's clear that GCC is checking to see if it matches a "vectorizable reduction" pattern.
First one just trivially changing a[] many times(temporary). Second one uses the last value of a[] each time(not temporary).
Until a version of patch, you could use "volatile" variable to vectorize.
Use
to make it aligned;
Shows "c" having different storage area than b and a.
I assume "movaps"-like instructions for being "vectorized" (from SSE-AVX instruction list)
Here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html#using
the 6th and 7th examples show similar difficulties.
GCC vectorizer is probably not smart enough to vectorize the first loop. The addition case is easier to vectorize because
a + 0 == a
. ConsiderSIZE==4
:X
denotes the combinations ofi
andj
whena
will be assigned to or increased. For the case of addition, we can compute the results ofb[i] > c[j] ? b[i] : c[j]
for, say,j==1
andi==0..4
and put it into vectorD
. Then we only need to zeroD[2..3]
and add resulting vector toa[0..3]
. For the case of assignment, it is a little more trickier. We must not only zeroD[2..3]
, but also zeroA[0..1]
and only then combine the results. I guess this is where the vectorizer is failing.