SIMD code runs slower than scalar code

elma and elmc are both unsigned long arrays. So are res1 and res2.

unsigned long simdstore[2];  
__m128i *p, simda, simdb, simdc;  
p = (__m128i *) simdstore;  

for (i = 0; i < _polylen; i++)  
{
    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k++)  
    {
        //res1[i + k] ^= _mulpre1[u1][k];  
        //res2[i + k] ^= _mulpre2[u2][k];               

        simda = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);  
        simdb = _mm_set_epi64x (res2[i + k], res1[i + k]);  
        simdc = _mm_xor_si128 (simda, simdb);  
        _mm_store_si128 (p, simdc);  
        res1[i + k] = simdstore[0];  
        res2[i + k] = simdstore[1];                     
    }     
}

Within the for loop is included both the non-simd and simd version of XOR of elements. First two lines within the second for loop do the explicit XOR, whereas the rest implements the simd version of the same operation.

This loop is called from outside hundreds of times, so optimizing this loop will help bring down the total computation time.

The problem is simd code runs many times slower than the scalar code.

EDIT: Done partial unrolling

__m128i *p1, *p2, *p3, *p4;  
p1 = (__m128i *) simdstore1;  
p2 = (__m128i *) simdstore2;  
p3 = (__m128i *) simdstore3;  
p4 = (__m128i *) simdstore4;  

for (i = 0; i < 20; i++)  
{
    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k = k + 4)  
    {
        simda1  = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);  
        simda2  = _mm_set_epi64x (_mulpre2[u2][k + 1], _mulpre1[u1][k + 1]);  
        simda3  = _mm_set_epi64x (_mulpre2[u2][k + 2], _mulpre1[u1][k + 2]);  
        simda4  = _mm_set_epi64x (_mulpre2[u2][k + 3], _mulpre1[u1][k + 3]);  

        simdb1  = _mm_set_epi64x (res2[i + k], res1[i + k]);  
        simdb2  = _mm_set_epi64x (res2[i + k + 1], res1[i + k + 1]);  
        simdb3  = _mm_set_epi64x (res2[i + k + 2], res1[i + k + 2]);  
        simdb4  = _mm_set_epi64x (res2[i + k + 3], res1[i + k + 3]);  

        simdc1  = _mm_xor_si128 (simda1, simdb1);  
        simdc2  = _mm_xor_si128 (simda2, simdb2);  
        simdc3  = _mm_xor_si128 (simda3, simdb3);  
        simdc4  = _mm_xor_si128 (simda4, simdb4);  

        _mm_store_si128 (p1, simdc1);  
        _mm_store_si128 (p2, simdc2);  
        _mm_store_si128 (p3, simdc3);  
        _mm_store_si128 (p4, simdc4);  

        res1[i + k]= simdstore1[0];  
        res2[i + k]= simdstore1[1]; 
        res1[i + k + 1]= simdstore2[0];  
        res2[i + k + 1]= simdstore2[1];   
        res1[i + k + 2]= simdstore3[0];  
        res2[i + k + 2]= simdstore3[1]; 
        res1[i + k + 3]= simdstore4[0];  
        res2[i + k + 3]= simdstore4[1];   
    }  
}

But, the result does not change much; it still takes twice as long as scalar code.

标签： c optimization sse simd sse2

4条回答

我想做一个坏孩纸

2楼-- · 2019-02-10 18:49

I'm no SIMD expert either, but it looks like you could also benefit from prefetching the data, coupled with the derolling EboMike mentioned. Might also help if you merged res1 and res2 into one aligned array(of structs depending on what else uses it), then you don't need the extra copying, you can operate directly on it.

0人赞添加讨论(0) 举报

唯我独甜

3楼-- · 2019-02-10 18:59

You're doing very little computation here relative to the number of loads and stores that are being performed, so as a result you are seeing little benefit from SIMD. You will probably find it more useful in this case to use scalar code, particularly if you have an x86-64 CPU that you can use in 64-bit mode. This will reduce the number of loads and stores, which are currently the dominant factor in your performance.

(Note: you probably should NOT unroll the loop, especially if you're using Core 2 or newer.)

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

4楼-- · 2019-02-10 19:07

The way your code looks res1 and res2 seem to be totally independent vectors. Yet you mix them in the same register to xor them.

I would use different registers something like the following (The vectors must all be aligned).

__m128i x0, x1, x2, x3;  
for (i = 0; i < _polylen; i++)  
{  

    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k+=2)  
    {     
        //res1[i + k] ^= _mulpre1[u1][k];
        x0= _mm_load_si128(&_mulpre1[u1][k]);
        x1= _mm_load_si128(&res1[i + k]);
        x0= _mm_xor_si128 (x0, x1);
        _mm_store_si128 (&res1[i + k], x0);
        //res2[i + k] ^= _mulpre2[u2][k];               
        x2= _mm_load_si128(&_mulpre2[u2][k]);
        x3= _mm_load_si128(&res2[i + k]);
        x2= _mm_xor_si128 (x2, x3);
        _mm_store_si128 (&res2[i + k], x2);
   }     
}

Note that I am using only 4 registers. You can manually unroll to use all 8 registers in x86 or more in x86_64

0人赞添加讨论(0) 举报

淡お忘

5楼-- · 2019-02-10 19:10

Disclaimer: I come from a PowerPC background, so what I'm saying here might be complete hogwash. But you're stalling your vector pipeline since you try to access your results right away.

It is best to keep everything in your vector pipeline. As soon as you do any kind of conversion from vector to int or float, or storing the result into memory, you're stalling.

The best mode of operation when dealing with SSE or VMX is: Load, process, store. Load the data into your vector registers, do all the vector processing, then store it to memory.

I would recommend: Reserve several __m128i registers, unroll your loop several times, then store it.

EDIT: Also, if you unroll, and if you align res1 and res2 by 16 bytes, you can store your results directly in memory without going through this simdstore indirection, which is probably a LHS and another stall.

EDIT: Forgot the obvious. If your polylen is typically large, don't forget to do a data cache prefetch on every iteration.

0人赞添加讨论(0) 举报

SIMD code runs slower than scalar code

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间