I vectorized the following loop, that crops up in an application that I am developing:
void vecScl(Node** A, Node* B, long val){
int fact = round( dot / const);
for(i=0; i<SIZE ;i++)
(*A)->vector[i] -= fact * B->vector[i];
}
And this is the SSE code:
void vecSclSSE(Node** A, Node* B, long val){
int fact = round( dot / const);
__m128i vecPi, vecQi, vecCi, vecQCi, vecResi;
int sseBound = SIZE/4;
for(i=0,j=0; j<sseBound ; i+=4,j++){
vecPi = _mm_loadu_si128((__m128i *)&((*A)->vector)[i] );
vecQi = _mm_set_epi32(fact,fact,fact,fact);
vecCi = _mm_loadu_si128((__m128i *)&((B)->vector)[i] );
vecQCi = _mm_mullo_epi32(vecQi,vecCi);
vecResi = _mm_sub_epi32(vecPi,vecQCi);
_mm_storeu_si128((__m128i *) (((*A)->vector) + i), vecResi );
}
//Compute remaining positions if SIZE % 4 != 0
for(; i<SIZE ;i++)
(*A)->vector[i] -= q * B->vector[i];
}
While this works in terms of correctness, the performance is exactly the same with and without SSE. I am compiling the code with:
g++ *.cpp *.h -msse4.1 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -Warray-bounds -O2
Is this because I am not allocating (and use the SSE functions accordingly) aligned memory? The code is very complicated to change, so I was kind of avoiding that for now.
BTW, in terms of further improvements, and considering that I am bounded to the Sandy Bridge architecture, what is the best that I can do?
EDIT: The compiler is not vectorizing the code yet. First, I changed the data types of the vectors to short
s, which doesn't change performance. Then, I compiled with -fno-tree-vectorize
and the performance is the same.
Thanks a lot
If your data is large then you may just be memory-bound, since you are doing very few ALU operations per load/store.
However there are a few minor improvements you can try:
inline void vecSclSSE(Node** A, Node* B, long val){
// make function inline, for cases where `val` is small
const int fact = (dot + const / 2 - 1) / const;
// use integer arithmetic here if possible
const __m128i vecQi = _mm_set1_epi32(fact);
// hoist constant initialisation out of loop
int32_t * const pA = (*A)->vector; // hoist invariant de-references out of loop
int32_t * const pB = B->vector;
__m128i vecPi, vecCi, vecQCi, vecResi;
for(int i = 0; i < SIZE - 3; i += 4) { // use one loop variable
vecPi = _mm_loadu_si128((__m128i *)&(pA[i]));
vecCi = _mm_loadu_si128((__m128i *)&(pB[i]));
vecQCi = _mm_mullo_epi32(vecQi,vecCi);
vecResi = _mm_sub_epi32(vecPi,vecQCi);
_mm_storeu_si128((__m128i *)&(pA[i]), vecResi);
}
//Compute remaining positions if SIZE % 4 != 0
for(; i<SIZE ;i++)
pA[i] -= q * pB[i];
}
As Paul said, you have relavively few computations per data access and your code is probably IO bound. Since unaligned stores/loads are slower than aligned ones, you really should align your data.
You should align on 16 bytes with SSE which is also a cache line, and (I think) 32 with AVX. If you allocated your data yourself, just use _aligned_alloc
. If you're using std::vector
the easiest way to align is use a custom allocator instead of std::allocator
. This allocator would call _aligned_alloc
or something like that instead of malloc
/new
. See also this question.
And then you could switch to aligned instructions for load/stores.
Also, I'm not sure what's the code generated by &((*A)->vector)[i]
, better use a local pointer to store the data, but be sure to annotate it __restrict
But before going into all this, be sure it's worth your time and the maintainance burden. You can profile with oprofile
for linux or AMD's code analyst
for windows.
I'd like to say that, for the same SIZE, I was able to vectorize a kernel that takes place right before the one in the first post. This time, I had great speedups (I won't say the factor because it is irrelevant unless I quantify the time spent with the kernel in whole application). The kernel computes the dot product of two vectors, i.e.:
for(i=0;i<SIZE;i++)
dot += A->vector[i] * B->vector[i];
From this I can conclude that the SIZE being small is not a problem. This suggests, in turn, that I might be doing wrong in the first kernel. Can somebody suggest a different set of SSE operations for the first kernel? I think it is worthwhile to try it. Next step is to allocate aligned memory but as mentioned before, this ain't crucial in Sandy Bridge and other new architectures.
This proved once again that the compiler was NOT vectorizing the code.
Thanks