How to make the following code faster

2019-04-12 11:59发布

问题:

int u1, u2;  
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long     
res1, res2 initialized to zero.  

l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);  
            simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res1[i + k], simdb);  

            simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);  
            simdb = _mm_load_si128 ((__m128i *)&res2[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res2[i + k], simdb);  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}

The above mentioned code is called many times in my program (profiler shows 98%).

EDIT: In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct

Any suggestion to make it faster is welcome.

回答1:

Unfortunately the most obvious optimisations are probably already being done by the compiler:

  • You can pull &_mulpre[u1] and &mulpre[u2] our of the inner loop.
  • You can pull &res1[i] our of the inner loop.
  • Using different variables for the two inner operations, and reordering them, might allow for better pipelining.

Possibly swapping the outer loops would improve cache locality on elm1.



回答2:

Well, you could always call it fewer times :-)

The total input & output data looks relatively small, depending on you design and expected input it might be feasible to just cache computations or do lazy evaluation instead of up-front.



回答3:

There is very little you can do with a routine such as this, since loads and stores will be the dominant factor (you're doing 2 loads + 1 store = 4 bus cycles for a single computational instruction).



回答4:

l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            _mm_stream_si128 ((__m128i *)&res1[i + k],
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *) &_mulpre[u1][k]),
                                    _mm_load_si128 ((__m128i *) &res1[i + k]
                                   ));  

            mm_stream_si128 ((__m128i *)&res2[i + k],    
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *)&_mulpre[u2][k]), 
                                    _mm_load_si128 ((__m128i *)&res2[i + k])
                                   ));  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}
  1. Do remember your are using intrinsic, using less _128mi/_mm128 value will speed up your program.
  2. try _mm_stream_si128(), it might speed up the storing process.
  3. try prefetch