可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

This is a follow on to this post. Disclaimer: I have done zero profiling and don't even have an application, this is purely for me to learn more about vectorization.

My code is below. I am compiling with gcc 4.9.4 on a machine with an i3 m370. The first loop vectorizes as I expect. However the second loop checking each element of temp is not vectorized AFAICT, with all the "andb" instructions. I expected it to be vectorized with something like _mm_test_all_ones. How can that loop also be vectorized? Second question, I really want this as part of a larger loop. If I uncomment whats below, nothing gets vectorized. How can I also get that vectorized?

#define ARR_LENGTH 4096
#define block_size 4
typedef float afloat __attribute__ ((__aligned__(16)));

char all_equal_2(afloat *a, afloat *b){
    unsigned int i, j;
    char r = 1;
    unsigned int temp[block_size] __attribute__((aligned(16)));
    //for (i=0; i<ARR_LENGTH; i+=block_size){

        for (j = 0; j < block_size; ++j) {
            temp[j] = (*a) == (*b);
            a++;
            b++;
        }

        for (j=0; j<block_size; j++){
            r &= temp[j];
        }

        /*if (r == 0){
            break;
        }
    }*/
    return r;
}

And the key section of resulting assembly:

.cfi_startproc
    movaps  (%rdi), %xmm0
    cmpeqps (%rsi), %xmm0
    movdqa  .LC0(%rip), %xmm1
    pand    %xmm0, %xmm1
    movaps  %xmm1, -24(%rsp)
    movl    -24(%rsp), %eax
    andl    $1, %eax
    andb    -20(%rsp), %al
    andb    -16(%rsp), %al
    andb    -12(%rsp), %al
    ret
    .cfi_endproc

Update: This post is similar to my first question. In that question, the vector was a raw pointer so segfaults are possible, but here that isn't a concern. Therefore AFAIK reordering the comparison operations is safe here, but not there. The conclusion is probably the same though.

回答1:

So your loop is quiet small and it is recursive: the result of iteration N is used as an input in iteration N+1. If you change your second loop to allow 2 operations per ieration:

        char r2 = r;
    for (j=0; j<block_size/2; j+=2){
        r &= temp[j];
        r2 &=temp[j+1];
    }
    r &= r2;

you will see output is optimized

.cfi_def_cfa_register %rbp
vmovss  (%rdi), %xmm0           ## xmm0 = mem[0],zero,zero,zero
vmovss  4(%rdi), %xmm1          ## xmm1 = mem[0],zero,zero,zero
vucomiss    (%rsi), %xmm0
sete    %al
vucomiss    4(%rsi), %xmm1
sete    %cl
andb    %al, %cl
movzbl  %cl, %eax
popq    %rbp
retq
.cfi_endproc

for the last point, with the code optimized and the outer loop enabled I see some optimizations. Did you change compilation options?

回答2:

Autovectorization really likes reductions operations, so the trick was to turn this into a reduction.

#define ARR_LENGTH 4096
typedef float afloat __attribute__ ((__aligned__(16)));
int foo(afloat *a, afloat *b){
    unsigned int i, j;
    unsigned int result;
    unsigned int blocksize = 4;
    for (i=0; i<ARR_LENGTH; i+=blocksize){
        result = 0;
        for (j=0; j<blocksize; j++){
            result += (*a) == (*b);
            a++;
            b++;
        }
        if (result == blocksize){
            blocksize *= 2;
        } else {
            break;
        }
    }
    blocksize = ARR_LENGTH - i;
    for (i=0; i<blocksize; i++){
        result += (*a) == (*b);
        a++;
        b++;
    }
    return result == i;
}

Compiles into a nice loop:

.L3:
        movaps  (%rdi,%rax), %xmm1
        addl    $1, %ecx
        cmpeqps (%rsi,%rax), %xmm1
        addq    $16, %rax
        cmpl    %r8d, %ecx
        psubd   %xmm1, %xmm0
        jb      .L3