I want to vectorize a[i] = a[i-1] +c
by AVX2 instructions. It seems its un vectorizable because of the dependencies. I've vectorized and want to share the answer here to see if there is any better answer to this question or my solution is good.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
I have implemented the following function for vectorizing this and it seems OK! The speedup is 2.5x over gcc -O3
Here is the solution:
// vectorized
inline void vec(int a[LEN], int b, int c)
{
// b=1 and c=2 in this case
int i = 0;
a[i++] = b;//0 --> a[0] = 1
//step 1:
//solving dependencies vectorization factor is 8
a[i++] = a[0] + 1*c; //1 --> a[1] = 1 + 2 = 3
a[i++] = a[0] + 2*c; //2 --> a[2] = 1 + 4 = 5
a[i++] = a[0] + 3*c; //3 --> a[3] = 1 + 6 = 7
a[i++] = a[0] + 4*c; //4 --> a[4] = 1 + 8 = 9
a[i++] = a[0] + 5*c; //5 --> a[5] = 1 + 10 = 11
a[i++] = a[0] + 6*c; //6 --> a[6] = 1 + 12 = 13
a[i++] = a[0] + 7*c; //7 --> a[7] = 1 + 14 = 15
// vectorization factor reached
// 8 *c will work for all
//loading the results to an vector
__m256i dep1, dep2; // dep = { 1, 3, 5, 7, 9, 11, 13, 15 }
__m256i coeff = _mm256_set1_epi32(8*c); //coeff = { 16, 16, 16, 16, 16, 16, 16, 16 }
for(; i<LEN-1; i+=16){
dep1 = _mm256_load_si256((__m256i *) &a[i-8]);
dep1 = _mm256_add_epi32(dep1, coeff);
_mm256_store_si256((__m256i *) &a[i], dep1);
dep2 = _mm256_load_si256((__m256i *) &a[i]);
dep2 = _mm256_add_epi32(dep2, coeff);
_mm256_store_si256((__m256i *) &a[i+8], dep2);
}
}