How can I use SIMD to accelerate XOR two blocks of

2019-02-18 13:37发布

问题:

I want to XOR two blocks of memory as quickly as possible, How can I use SIMD to accelerate it?

My original code is below:

void region_xor_w64(   unsigned char *r1,         /* Region 1 */
                       unsigned char *r2,         /* Region 2 */
                       int nbytes)       /* Number of bytes in region */
{
    uint64_t *l1;
    uint64_t *l2;
    uint64_t *ltop;
    unsigned char *ctop;

    ctop = r1 + nbytes;
    ltop = (uint64_t *) ctop;
    l1 = (uint64_t *) r1;
    l2 = (uint64_t *) r2;

    while (l1 < ltop) {
        *l2 = ((*l1)  ^ (*l2));
        l1++;
        l2++;
    }
}

I wrote one myself, but little speed increased.

void region_xor_sse(   unsigned char* dst,
                       unsigned char* src,
                       int block_size){
  const __m128i* wrd_ptr = (__m128i*)src;
  const __m128i* wrd_end = (__m128i*)(src+block_size);
  __m128i* dst_ptr = (__m128i*)dst;

  do{
    __m128i xmm1 = _mm_load_si128(wrd_ptr);
    __m128i xmm2 = _mm_load_si128(dst_ptr);

    xmm2 = _mm_xor_si128(xmm1, xmm2);
    _mm_store_si128(dst_ptr, xmm2);
    ++dst_ptr;
    ++wrd_ptr;
  }while(wrd_ptr < wrd_end);
}

回答1:

The more important question is why would you want to do it manually. Do you have an ancient compiler that you think you can outsmart? Those good old times when you had to manually write SIMD instructions are over. Today, in 99% of cases compiler will do the job for you, and chances are than it will do a lot better job. Also, don't forget that there are new architectures coming out every once in a while with more and more extended instruction set. So ask yourself a question — do you want to maintain N copies of your implementation for each platform? Do you want to constantly test your implementation to make sure it is worth maintaining? Most likely the answer would be no.

The only thing you need to do is to write the simplest possible code. Compiler will do the rest. For instance, here is how I would write your function:

void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int len)
{
    unsigned int i;
    for (i = 0; i < len; ++i)
        r2[i] = r1[i] ^ r2[i];
}

A bit simpler, isn't it? And guess what, compiler is generating code that performs 128-bit XOR using MOVDQU and PXOR, the critical path looks like this:

4008a0:       f3 0f 6f 04 06          movdqu xmm0,XMMWORD PTR [rsi+rax*1]
4008a5:       41 83 c0 01             add    r8d,0x1
4008a9:       f3 0f 6f 0c 07          movdqu xmm1,XMMWORD PTR [rdi+rax*1]
4008ae:       66 0f ef c1             pxor   xmm0,xmm1
4008b2:       f3 0f 7f 04 06          movdqu XMMWORD PTR [rsi+rax*1],xmm0
4008b7:       48 83 c0 10             add    rax,0x10
4008bb:       45 39 c1                cmp    r9d,r8d
4008be:       77 e0                   ja     4008a0 <region_xor_w64+0x40>

As @Mysticial has pointed out, the above code is using instruction that support unaligned access. Those are slower. If, however, a programmer can correctly assume an aligned access then it is possible to let compiler know about it. For example:

void region_xor_w64(unsigned char * restrict r1,
                    unsigned char * restrict r2,
                    unsigned int len)
{
    unsigned char * restrict p1 = __builtin_assume_aligned(r1, 16);
    unsigned char * restrict p2 = __builtin_assume_aligned(r2, 16);

    unsigned int i;
    for (i = 0; i < len; ++i)
        p2[i] = p1[i] ^ p2[i];
}

The compiler generates the following for the above C code (notice movdqa):

400880:       66 0f 6f 04 06          movdqa xmm0,XMMWORD PTR [rsi+rax*1]
400885:       41 83 c0 01             add    r8d,0x1
400889:       66 0f ef 04 07          pxor   xmm0,XMMWORD PTR [rdi+rax*1]
40088e:       66 0f 7f 04 06          movdqa XMMWORD PTR [rsi+rax*1],xmm0
400893:       48 83 c0 10             add    rax,0x10
400897:       45 39 c1                cmp    r9d,r8d
40089a:       77 e4                   ja     400880 <region_xor_w64+0x20>

Tomorrow, when I buy myself a laptop with a Haswell CPU, the compiler will generate me a code that use 256-bit instructions instead of 128-bit from the same code giving me twice the vector performance. It would do it even if I didn't know that Haswell is capable of it. You would have to not only know about that feature, but write another version of your code and spend some time testing it.

By the way, it seems like you also have a bug in your implementation where the code can skip up to 3 remaining bytes in the data vector.

At any rate, I would recommend you trust your compiler and learn how to verify what is generates (i.e. get familiar with objdump). The next choice would be to change the compiler. Only then start thinking about writing vector processing instructions manually. Or you gonna have a bad time!

Hope it helps. Good Luck!



标签: c xor simd