可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
When compiling with gcc -O3
, why does the following loop not vectorize (automatically):
#define SIZE (65536)
int a[SIZE], b[SIZE], c[SIZE];
int foo () {
int i, j;
for (i=0; i<SIZE; i++){
for (j=i; j<SIZE; j++) {
a[i] = b[i] > c[j] ? b[i] : c[j];
}
}
return a[0];
}
when the following one does?
#define SIZE (65536)
int a[SIZE], b[SIZE], c[SIZE];
int foov () {
int i, j;
for (i=0; i<SIZE; i++){
for (j=i; j<SIZE; j++) {
a[i] += b[i] > c[j] ? b[i] : c[j];
}
}
return a[0];
}
The only difference is whether the result of the expression in the inner loop is assigned to a[i], or added to a[i].
For reference -ftree-vectorizer-verbose=6
gives the following output for the first (non-vectorizing) loop.
v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: not vectorized: live stmt not supported: D.2700_5 = c[j_20];
v.c:5: note: vectorized 0 loops in function.
And the same output for the loop that vectorizes is:
v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: vect_model_load_cost: aligned.
v.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
v.c:9: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
v.c:9: note: vect_model_reduction_cost: inside_cost = 1, outside_cost = 6 .
v.c:9: note: cost model: prologue peel iters set to vf/2.
v.c:9: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
v.c:9: note: Cost model analysis:
Vector inside of loop cost: 3
Vector outside of loop cost: 27
Scalar iteration cost: 3
Scalar outside cost: 7
prologue iterations: 2
epilogue iterations: 2
Calculated minimum iters for profitability: 8
v.c:9: note: Profitability threshold = 7
v.c:9: note: Profitability threshold is 7 loop iterations.
v.c:9: note: LOOP VECTORIZED.
v.c:5: note: vectorized 1 loops in function.
回答1:
In the first case: the code overwrites the same memory location a[i]
in each iteration. This inherently sequentializes the loop as the loop iterations are not independent.
(In reality, only the final iteration is actually needed. So the entire inner loop could be taken out.)
In the second case: GCC recognizes the loop as a reduction operation - for which it has special case handling to vectorize.
Compiler vectorization is often implemented as some sort of "pattern matching". Meaning the compiler analyzes code to see if it fits a certain pattern that it's able to vectorize. If it does, it gets vectorized. If it doesn't, then it doesn't.
This seems to be a corner case where the first loop doesn't fit any of the pre-coded patterns that GCC can handle. But the second case fits the "vectorizable reduction" pattern.
Here's the relevant part of GCC's source code that spits out that "not vectorized: live stmt not supported: "
message:
http://svn.open64.net/svnroot/open64/trunk/osprey-gcc-4.2.0/gcc/tree-vect-analyze.c
if (STMT_VINFO_LIVE_P (stmt_info))
{
ok = vectorizable_reduction (stmt, NULL, NULL);
if (ok)
need_to_vectorize = true;
else
ok = vectorizable_live_operation (stmt, NULL, NULL);
if (!ok)
{
if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
{
fprintf (vect_dump,
"not vectorized: live stmt not supported: ");
print_generic_expr (vect_dump, stmt, TDF_SLIM);
}
return false;
}
}
From just the line:
vectorizable_reduction (stmt, NULL, NULL);
It's clear that GCC is checking to see if it matches a "vectorizable reduction" pattern.
回答2:
GCC vectorizer is probably not smart enough to vectorize the first loop. The addition case is easier to vectorize because a + 0 == a
. Consider SIZE==4
:
0 1 2 3 i
0 X
1 X X
2 X X X
3 X X X X
j
X
denotes the combinations of i
and j
when a
will be assigned to or increased. For the case of addition, we can compute the results of b[i] > c[j] ? b[i] : c[j]
for, say, j==1
and i==0..4
and put it into vector D
. Then we only need to zero D[2..3]
and add resulting vector to a[0..3]
. For the case of assignment, it is a little more trickier. We must not only zero D[2..3]
, but also zero A[0..1]
and only then combine the results. I guess this is where the vectorizer is failing.
回答3:
The first loop is equivalent to
#define SIZE (65536)
int a[SIZE], b[SIZE], c[SIZE];
int foo () {
int i, j;
for (i=0; i<SIZE; i++){
a[i] = b[i] > c[SIZE - 1] ? b[i] : c[SIZE - 1];
}
return a[0];
}
The problem with the original expression is that it really doesn't make that much sense, so it's not too surprising gcc isn't able to vectorize it.
回答4:
First one just trivially changing a[] many times(temporary).
Second one uses the last value of a[] each time(not temporary).
Until a version of patch, you could use "volatile" variable to vectorize.
Use
int * c=malloc(sizeof(int));
to make it aligned;
v.c:9: note: Unknown alignment for access: c
Shows "c" having different storage area than b and a.
I assume "movaps"-like instructions for being "vectorized" (from SSE-AVX instruction list)
Here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html#using
the 6th and 7th examples show similar difficulties.