可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

int u1, u2;  
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long     
res1, res2 initialized to zero.  

l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);  
            simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res1[i + k], simdb);  

            simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);  
            simdb = _mm_load_si128 ((__m128i *)&res2[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res2[i + k], simdb);  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}

The above mentioned code is called many times in my program (profiler shows 98%).

EDIT: In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct

Any suggestion to make it faster is welcome.

回答1:

Unfortunately the most obvious optimisations are probably already being done by the compiler:

You can pull &_mulpre[u1] and &mulpre[u2] our of the inner loop.
You can pull &res1[i] our of the inner loop.
Using different variables for the two inner operations, and reordering them, might allow for better pipelining.

Possibly swapping the outer loops would improve cache locality on elm1.

回答2:

Well, you could always call it fewer times :-)

The total input & output data looks relatively small, depending on you design and expected input it might be feasible to just cache computations or do lazy evaluation instead of up-front.

回答3:

There is very little you can do with a routine such as this, since loads and stores will be the dominant factor (you're doing 2 loads + 1 store = 4 bus cycles for a single computational instruction).

回答4:

l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            _mm_stream_si128 ((__m128i *)&res1[i + k],
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *) &_mulpre[u1][k]),
                                    _mm_load_si128 ((__m128i *) &res1[i + k]
                                   ));  

            mm_stream_si128 ((__m128i *)&res2[i + k],    
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *)&_mulpre[u2][k]), 
                                    _mm_load_si128 ((__m128i *)&res2[i + k])
                                   ));  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}

Do remember your are using intrinsic, using less _128mi/_mm128 value will speed up your program.
try _mm_stream_si128(), it might speed up the storing process.
try prefetch