I'm trying to get my code to auto vectorize, but it isn't working.
int _tmain(int argc, _TCHAR* argv[])
{
const int N = 4096;
float x[N];
float y[N];
float sum = 0;
//create random values for x and y
for (int i = 0; i < N; i++)
{
x[i] = rand() >> 1;
y[i] = rand() >> 1;
}
for (int i = 0; i < N; i++){
sum += x[i] * y[i];
}
}
Neither loop vectorizes here, but I'm really only interested in the second loop.
I'm using visual studio express 2013 and am compiling with the /O2
and /Qvec-report:2
(To report whether or not the loop was vectorized) options. When I compile, I get the following message:
--- Analyzing function: main
c:\users\...\documents\visual studio 2013\projects\intrin3\intrin3\intrin3.cpp(28) : info C5002: loop not vectorized due to reason '1200'
c:\users\...\documents\visual studio 2013\projects\intrin3\intrin3\intrin3.cpp(41) : info C5002: loop not vectorized due to reason '1305'
Reason '1305', as can be seen HERE, says that "the compiler can't discern proper vectorizable type information for this loop." I'm not really sure what this means. Any ideas?
After splitting the second loop into two loops:
for (int i = 0; i < N; i++){
sumarray[i] = x[i] * y[i];
}
for (int i = 0; i < N; i++){
sum += sumarray[i];
}
Now the first of the above loops vectorizes, but the second one does not, again with error code 1305.
The error 1305 happens because the optimizer did not vectorize the loop since the value sum
is not used. Simply adding printf("%d\n", sum)
fixes that. But then you get a new error code 1105 "Loop includes a non-recognized reduction operation". To fix this you need you need to set /fp:fast
The reason is that floating point arithmetic is not associative and reductions using SIMD or MIMD (i.e. using multiple threads) need to be associative. By using a looser floating point model you can do the reduction.
I just tested it with the following code and the default fp:precise
does not vectorize and when I use fp:fast
it does.
#include <stdio.h>
int main() {
const int N = 4096;
float x[N];
float y[N];
float sum = 0;
for (int i = 0; i < N; i++){
sum += x[i] * y[i];
}
printf("sum %f\n", sum);
}
In regards to your question about the loop with the rand()
function the rand()
function is not a SIMD function. It can't be vectorized. You need to find a SIMD rand() function. I don't know of one. An alternative is pre-compute an array of random numbers and use the array instead. In any case rand()
is a horrible random number generate and is only useful for some toy cases. Consider using the Mersenne twister PRNG.
One problem could be that your stack allocation isn't necessarily aligned by your compiler. If your compiler supports c++11 you could use:
float x[N] alignas(16);
float y[N] alignas(16);
To explicitly get 16 byte aligned memory, which is required by most SSE operations.
EDIT:
Even if alignment isn't the issue and your compiler is vectorizing unaligned code you should make this optimization as unaligned SSE operations are very slow compared to their aligned counterparts.