Manually vectorized code 10x slower than auto opti

2019-07-21 14:17发布

I'm trying to learn how to exploit vectorization with gcc. I followed this tutorial of Erik Holk ( with source code here )

I just modified it to double. I used this dotproduct to compute multiplication of randomly generated square matrices 1200x1200 of doubles ( 300x300 double4 ). I checked that the results are the same. But what really surprised me is, that the simple dotproduct was actually 10x faster than my manually vectorized.

maybe, double4 is too big for SSE ( it would need AVX2 ? ) But I would expect that even in case when gcc cannot find suitable instruction for dealing with double4 at once, it would still be able to exploit the explicit information that data are in big chunks for auto-vectorization.


Details:

the results was:

dot_simple:
time elapsed 1.90000 [s] for 1.728000e+09 evaluations => 9.094737e+08 [ops/s]

dot_SSE:
time elapsed 15.78000 [s] for 1.728000e+09 evaluations => 1.095057e+08 [ops/s]

I used gcc 4.6.3 on Intel® Core™ i5 CPU 750 @ 2.67GHz × 4 with these options -std=c99 -O3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math or with just -O2 ( the result was the same )

I did it using python/scipy.weave() for convenience, but I hope it doesn't change anything

The code:

double dot_simple(  int n, double *a, double *b ){
    double dot = 0;
    for (int i=0; i<n; i++){ 
        dot += a[i]*b[i];
    }
    return dot;
}

and that one using explicitly gcc vector extensiobns

double dot_SSE(  int n, double *a, double *b ){
    const int VECTOR_SIZE = 4;
    typedef double double4 __attribute__ ((vector_size (sizeof(double) * VECTOR_SIZE)));
    double4 sum4 = {0};
    double4* a4 = (double4 *)a;
    double4* b4 = (double4 *)b;
    for (int i=0; i<n; i++){ 
        sum4 += *a4 * *b4 ;
        a4++; b4++;
        //sum4 += a4[i] * b4[i];
    }
    union {  double4 sum4_; double sum[VECTOR_SIZE]; };
    sum4_ = sum4;
    return sum[0]+sum[1]+sum[2]+sum[3];
}

Then I used it for multiplication of 300x300 random matrix to measure performance

void mmul( int n, double* A, double* B, double* C ){
    int n4 = n*4;
    for (int i=0; i<n4; i++){
        for (int j=0; j<n4; j++){
            double* Ai = A + n4*i;
            double* Bj = B + n4*j;
            C[ i*n4 + j ] =  dot_SSE( n, Ai, Bj );
            //C[ i*n4 + j ] =  dot_simple( n4, Ai, Bj );
            ijsum++;
        }
    }
}

scipy weave code:

def mmul_2(A, B, C, __force__=0 ):
    code = r'''     mmul( NA[0]/4, A, B, C );            '''
    weave_options = {
    'extra_compile_args': ['-std=c99 -O3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math'],
    'compiler' : 'gcc', 'force' : __force__ }
    return weave.inline(code, ['A','B','C'], verbose=3, headers=['"vectortest.h"'],include_dirs=['.'], **weave_options )

1条回答
beautiful°
2楼-- · 2019-07-21 14:51

One of the main problems is that in your function dot_SSE you loop over n items when you should only loop over n/2 items (or n/4 with AVX).

To fix this with GCC's vector extensions you can do this:

double dot_double2(int n, double *a, double *b ) {
    typedef double double2 __attribute__ ((vector_size (16)));
    double2 sum2 = {};
    int i;
    double2* a2 = (double2*)a;
    double2* b2 = (double2*)b;
    for(i=0; i<n/2; i++) {
        sum2 += a2[i]*b2[i];
    }
    double dot = sum2[0] + sum2[1];
    for(i*=2;i<n; i++) dot +=a[i]*b[i]; 
    return dot;
}

The other problem with your code is that it has a dependency chain. Your CPU can do a simultaneous SSE addition and multiplication but only for independent data paths. To fix this you need to unroll the loop. The following code unrolls the loop by 2 (but you probably need to unroll by three for the best results).

double dot_double2_unroll2(int n, double *a, double *b ) {
    typedef double double2 __attribute__ ((vector_size (16)));
    double2 sum2_v1 = {};
    double2 sum2_v2 = {};
    int i;
    double2* a2 = (double2*)a;
    double2* b2 = (double2*)b;
    for(i=0; i<n/4; i++) {       
        sum2_v1 += a2[2*i+0]*b2[2*i+0];
        sum2_v2 += a2[2*i+1]*b2[2*i+1];
    }
    double dot = sum2_v1[0] + sum2_v1[1] + sum2_v2[0] + sum2_v2[1];
    for(i*=4;i<n; i++) dot +=a[i]*b[i]; 
    return dot;
}

Here is a version using double4 which I think is really what you wanted with your original dot_SSE function. It's ideal for AVX (though it still needs to be unrolled) but it will still work with SSE2 as well. In fact with SSE it seems GCC breaks it into two chains which effectively unrolls the loop by 2.

double dot_double4(int n, double *a, double *b ) {
    typedef double double4 __attribute__ ((vector_size (32)));
    double4 sum4 = {};
    int i;
    double4* a4 = (double4*)a;
    double4* b4 = (double4*)b;
    for(i=0; i<n/4; i++) {       
        sum4 += a4[i]*b4[i];
    }
    double dot = sum4[0] + sum4[1] + sum4[2] + sum4[3];
    for(i*=4;i<n; i++) dot +=a[i]*b[i]; 
    return dot;
}

If you compile this with FMA it will generate FMA3 instructions. I tested all these functions here (you can edit and compile the code yourself as well) http://coliru.stacked-crooked.com/a/273268902c76b116

Note that using SSE/AVX for a single dot production in matrix multiplication is not the optimal use of SIMD. You should do two (four) dot products at once with SSE (AVX) for double floating point.

查看更多
登录 后发表回答