how to vectorize a[i] = a[i-1] +c with AVX2

2019-08-07 11:46发布

问题:

I want to vectorize a[i] = a[i-1] +c by AVX2 instructions. It seems its un vectorizable because of the dependencies. I've vectorized and want to share the answer here to see if there is any better answer to this question or my solution is good.

回答1:

I have implemented the following function for vectorizing this and it seems OK! The speedup is 2.5x over gcc -O3 Here is the solution:

// vectorized
inline void vec(int a[LEN], int b, int c)
{
    // b=1 and c=2 in this case
    int i = 0;
    a[i++] = b;//0 --> a[0] = 1
    //step 1:
    //solving dependencies vectorization factor is 8
    a[i++] = a[0] + 1*c; //1  --> a[1] = 1 + 2  = 3
    a[i++] = a[0] + 2*c; //2  --> a[2] = 1 + 4  = 5
    a[i++] = a[0] + 3*c; //3  --> a[3] = 1 + 6  = 7
    a[i++] = a[0] + 4*c; //4  --> a[4] = 1 + 8  = 9
    a[i++] = a[0] + 5*c; //5  --> a[5] = 1 + 10 = 11
    a[i++] = a[0] + 6*c; //6  --> a[6] = 1 + 12 = 13
    a[i++] = a[0] + 7*c; //7  --> a[7] = 1 + 14 = 15
    // vectorization factor reached
    // 8 *c will work for all 
    //loading the results to an vector
    __m256i dep1, dep2; //  dep = { 1,   3,  5, 7,  9,  11, 13, 15 }
    __m256i coeff = _mm256_set1_epi32(8*c); //coeff = { 16, 16, 16, 16, 16, 16, 16, 16 }

    for(; i<LEN-1; i+=16){

        dep1 = _mm256_load_si256((__m256i *) &a[i-8]);
        dep1 = _mm256_add_epi32(dep1, coeff);
        _mm256_store_si256((__m256i *) &a[i], dep1);    

        dep2 = _mm256_load_si256((__m256i *) &a[i]);
        dep2 = _mm256_add_epi32(dep2, coeff);
        _mm256_store_si256((__m256i *) &a[i+8], dep2);  

    }
}