Why gcc autovectorization does not work on convolu

2019-02-12 17:28发布

问题:

I've implemented the following program for convolution matrix

#include <stdio.h>
#include <time.h>

#define NUM_LOOP 1000
#define N 128   //input or output dimention 1
#define M N     //input or output dimention 2
#define P 5 //convolution matrix dimention 1 if you want a 3x3 convolution matrix it must be 3
#define Q P     //convolution matrix dimention 2
#define Csize P*Q   
#define Cdiv  1     //div for filter 
#define Coffset 0   //offset 

//functions
void unusual(); //unusual implementation of convolution
void naive();
//data
unsigned short int input[N][M] __attribute__(( aligned(32))); // input data
unsigned short int output[N][M] __attribute__(( aligned(32))); // out put data
unsigned short int kernel[P][Q] __attribute__(( aligned(32)));//convolution coefficients

int main(){
    struct timespec tStart, tEnd;//used to record the processiing time
    double tTotal , tBest=10000;//minimum of toltal time will asign to the best time

    int w=0;
    do{// this loop repeat the body to record the best time
        clock_gettime(CLOCK_MONOTONIC,&tStart);

        //function to be executed here :

        unusual();

        clock_gettime(CLOCK_MONOTONIC,&tEnd);
        tTotal = (tEnd.tv_sec - tStart.tv_sec);
        tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;

        if(tTotal<tBest)
            tBest=tTotal;
    } while(w++ < NUM_LOOP);

    printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, MAX1, MAX2);

    return 0;
}

//unusual sequential convolution
void unusual(){
    int i, j,k,temp;

    for (i=P/2; i< N-P/2; i++){
        for(j=Q/2; j< M-Q/2; j++){
            temp=0;
            for(k=0; k< Csize; k++){
                temp += (kernel[k/P][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);

            }
            output[i][j]=((temp/(Cdiv))+Coffset);
        }
    }
}
//The naive implementation
inline void naive(){
    int i, j,k,l,temp;
    for (i=P/2; i< N-P/2; i++){
        for(j=Q/2; j< M-Q/2; j++){
            temp=0;

            for(k = 0; k <  P; k++){ 
                for(l = 0; l <  Q; l++){
                    temp += (kernel[k][l]) * (input[i - (P/2)+k][j - (Q/2)+l]);
                }
            }
            output[i][j]=((temp/(Cdiv))+Coffset);
        }
    }
}

The problem is when I use -O3 for auto vectorizing, it just works for an 3x3 convolution matrix. I've seen the Assembly output and auto vectorization just make some changes for 3x3 kernel and improve the performance reasonably (20 time faster note: scalar version of unusual func is slower than naive fun) but there is no improvement for 5x5 convolution matrix

UPDATE: I added the naive implementation to the question and changed the picture size to NxM, conv matrix to kernel, Cdim1xCdim2 to PxQ, and seqConv function to unusual for clarification. The question is not to improve the implementation of the unusual function. The question is while all elements are in the same places of the memory, gcc uses heuristic, etc. why gcc fails to improve this unusual implementation? NOTE: the problem is not about the naive implementation. gcc -O3 improve the naive implementation for 3x3, 5x5 kernels by ~7 speedup. and it also does for 7x7 and 9x9 by ~1.5 speedup. To improve the convolution I used intrinsics and speedup is more than 40x over the naive implementation which is ~ 2x faster than unusual convolution. So my vectorization is ~80x faster than my unusual one. Hand tuning optimization is not the problem. Auto-vectorizer optimization is the problem, and the reason of the fails.

GCC command : gcc -Wall -march=native -O3 -o "%e" "%f"

Platform: Linux mint, Skylake, gcc 6.2

Thanks in advance

回答1:

It seems no one is interested in answering this question. So I will share my findings and update my answer in future.

First update: In my experience, gcc -fopt-info-vec reports vectorizing for Csize <= 16 It is because the vectorization factor is 16 and It is one of the reasons that gcc do not vectorize the unusual implementation for other kernel sizes. Vectorization factor refers to the number of elements which can be put in a vector. In this case short integer is equal to 16-bit element.

From wikipedia:

In the first step, the compiler looks for obstacles that can prevent vectorization. A major obstacle for vectorization is true data dependency shorter than the vector length. Other obstacles include function calls and short iteration counts.



回答2:

My guess is that it is failing to optimize due to issues with memory alignment. You have specified convolution to be of 2-byte shorts. Most SSE functions like to work with 128 bit vectors, and AVX likes 512 bit vectors.

On my machine I declared conv like this:

uint16_t conv[Cdim1][8] = {0}; //You need to pad extra fields with zeroes

And later replace inner loop like this:

for(ki = 0; ki < Cdim; ++ki) 
    for(kj = 0; kj < 8; ++kj)
        temp += (conv[ki][kj]) * (input[i - (Cdim1/2) + ki][j - (Cdim2/2) + kj]);

Compiling with: gcc so.c -Wall -Wextra -Ofast -mtune=native gave me vector optimizations!

Bad things:

  • Don't use 8. Try to find minimal required padding and make macro so it works with convolutions matrices of dimension >= 8
  • Pad input with some zeroes so that undefined behavior at the end dissapears
  • Note that this is actually not increasing your perf. In fact it works slower!
  • Note that you can squeeze a couple of cycles if you modify this further in way that you execute loops in following order for(ki) for(i) for(j) for(kj). This is probably due to less register pressure since each row of conv can be stored longer. This might also be a glitch on my CPU.
  • You might want to consider using __attribute__ ((aligned (8))) when declaring variables as well. In this case it didn't change anything, but when optimising you want to consider this as well. Naturally this will work only on GCC and you will need other hacks for MSVC.


回答3:

The main obstacle for auto-vectorizer is non-constant loop variant. In your Implementation if you use int Csize = P*Q; It wont be vectorized. So for help the auto vector you should consider this. It's not the problem because you declared the Csize like #define Csize. But note it in your works. Then your unusual implementation is a loop-transformation of the nave implementation which is an optimization method in compilers. It seems you ruined the naive implementation. Your finding says that its restricted because of the 16 so I unrolled your unusual function and auto-vectorizer says it has been vectorized.

for(k=0; k< P*Q; k+=2){
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+1)/Q)][j - (Q/2) + ((k+1)%Q)]);
}

It also work for 7x7 kernel:

for(k=0; k< P*Q; k+=4){//IACA_START
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+1)/Q)][j - (Q/2) + ((k+1)%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+2)/Q)][j - (Q/2) + ((k+2)%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+3)/Q)][j - (Q/2) + ((k+3)%Q)]);
}

you dont need to unroll it by your self you might be able to force the compiler to unroll or change the loop structure by #pragma attributes. It is because of the SLP concept which compilers use for auto-vectorizing and interestingly SLP is based on unrolling!.