I've implemented the following program for convolution matrix
#include <stdio.h>
#include <time.h>
#define NUM_LOOP 1000
#define N 128 //input or output dimention 1
#define M N //input or output dimention 2
#define P 5 //convolution matrix dimention 1 if you want a 3x3 convolution matrix it must be 3
#define Q P //convolution matrix dimention 2
#define Csize P*Q
#define Cdiv 1 //div for filter
#define Coffset 0 //offset
//functions
void unusual(); //unusual implementation of convolution
void naive();
//data
unsigned short int input[N][M] __attribute__(( aligned(32))); // input data
unsigned short int output[N][M] __attribute__(( aligned(32))); // out put data
unsigned short int kernel[P][Q] __attribute__(( aligned(32)));//convolution coefficients
int main(){
struct timespec tStart, tEnd;//used to record the processiing time
double tTotal , tBest=10000;//minimum of toltal time will asign to the best time
int w=0;
do{// this loop repeat the body to record the best time
clock_gettime(CLOCK_MONOTONIC,&tStart);
//function to be executed here :
unusual();
clock_gettime(CLOCK_MONOTONIC,&tEnd);
tTotal = (tEnd.tv_sec - tStart.tv_sec);
tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;
if(tTotal<tBest)
tBest=tTotal;
} while(w++ < NUM_LOOP);
printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, MAX1, MAX2);
return 0;
}
//unusual sequential convolution
void unusual(){
int i, j,k,temp;
for (i=P/2; i< N-P/2; i++){
for(j=Q/2; j< M-Q/2; j++){
temp=0;
for(k=0; k< Csize; k++){
temp += (kernel[k/P][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);
}
output[i][j]=((temp/(Cdiv))+Coffset);
}
}
}
//The naive implementation
inline void naive(){
int i, j,k,l,temp;
for (i=P/2; i< N-P/2; i++){
for(j=Q/2; j< M-Q/2; j++){
temp=0;
for(k = 0; k < P; k++){
for(l = 0; l < Q; l++){
temp += (kernel[k][l]) * (input[i - (P/2)+k][j - (Q/2)+l]);
}
}
output[i][j]=((temp/(Cdiv))+Coffset);
}
}
}
The problem is when I use -O3
for auto vectorizing, it just works for an 3x3 convolution matrix. I've seen the Assembly output and auto vectorization just make some changes for 3x3 kernel and improve the performance reasonably (20 time faster note: scalar version of unusual func is slower than naive fun) but there is no improvement for 5x5 convolution matrix
UPDATE: I added the naive implementation to the question and changed the picture size to NxM, conv matrix to kernel, Cdim1xCdim2 to PxQ, and seqConv function to unusual for clarification. The question is not to improve the implementation of the unusual function. The question is while all elements are in the same places of the memory, gcc uses heuristic, etc. why gcc fails to improve this unusual implementation?
NOTE: the problem is not about the naive implementation. gcc -O3
improve the naive implementation for 3x3, 5x5 kernels by ~7 speedup. and it also does for 7x7 and 9x9 by ~1.5 speedup. To improve the convolution I used intrinsics and speedup is more than 40x over the naive implementation which is ~ 2x faster than unusual convolution. So my vectorization is ~80x faster than my unusual one. Hand tuning optimization is not the problem. Auto-vectorizer optimization is the problem, and the reason of the fails.
GCC command : gcc -Wall -march=native -O3 -o "%e" "%f"
Platform: Linux mint, Skylake, gcc 6.2
Thanks in advance