This is a follow on to this post. Disclaimer: I have done zero profiling and don't even have an application, this is purely for me to learn more about vectorization.
My code is below. I am compiling with gcc 4.9.4 on a machine with an i3 m370. The first loop vectorizes as I expect. However the second loop checking each element of temp is not vectorized AFAICT, with all the "andb" instructions. I expected it to be vectorized with something like _mm_test_all_ones. How can that loop also be vectorized? Second question, I really want this as part of a larger loop. If I uncomment whats below, nothing gets vectorized. How can I also get that vectorized?
#define ARR_LENGTH 4096
#define block_size 4
typedef float afloat __attribute__ ((__aligned__(16)));
char all_equal_2(afloat *a, afloat *b){
unsigned int i, j;
char r = 1;
unsigned int temp[block_size] __attribute__((aligned(16)));
//for (i=0; i<ARR_LENGTH; i+=block_size){
for (j = 0; j < block_size; ++j) {
temp[j] = (*a) == (*b);
a++;
b++;
}
for (j=0; j<block_size; j++){
r &= temp[j];
}
/*if (r == 0){
break;
}
}*/
return r;
}
And the key section of resulting assembly:
.cfi_startproc
movaps (%rdi), %xmm0
cmpeqps (%rsi), %xmm0
movdqa .LC0(%rip), %xmm1
pand %xmm0, %xmm1
movaps %xmm1, -24(%rsp)
movl -24(%rsp), %eax
andl $1, %eax
andb -20(%rsp), %al
andb -16(%rsp), %al
andb -12(%rsp), %al
ret
.cfi_endproc
Update: This post is similar to my first question. In that question, the vector was a raw pointer so segfaults are possible, but here that isn't a concern. Therefore AFAIK reordering the comparison operations is safe here, but not there. The conclusion is probably the same though.