I am attempting to optimise some loops and I have managed but I wonder if I have only done it partially correct. Say for example that I have this loop:
for(i=0;i<n;i++){
b[i] = a[i]*2;
}
unrolling this by a factor of 3, produces this:
int unroll = (n/4)*4;
for(i=0;i<unroll;i+=4)
{
b[i] = a[i]*2;
b[i+1] = a[i+1]*2;
b[i+2] = a[i+2]*2;
b[i+3] = a[i+3]*2;
}
for(;i<n;i++)
{
b[i] = a[i]*2;
}
Now is the SSE translation equivalent:
__m128 ai_v = _mm_loadu_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v, two_v);
_mm_storeu_ps(&b[i], ai2_v);
or is it:
__m128 ai_v = _mm_loadu_ps(&a[i]);
__m128 two_v = _mm_set1_ps(2);
__m128 ai2_v = _mm_mul_ps(ai_v, two_v);
_mm_storeu_ps(&b[i], ai2_v);
__m128 ai1_v = _mm_loadu_ps(&a[i+1]);
__m128 two1_v = _mm_set1_ps(2);
__m128 ai_1_2_v = _mm_mul_ps(ai1_v, two1_v);
_mm_storeu_ps(&b[i+1], ai_1_2_v);
__m128 ai2_v = _mm_loadu_ps(&a[i+2]);
__m128 two2_v = _mm_set1_ps(2);
__m128 ai_2_2_v = _mm_mul_ps(ai2_v, two2_v);
_mm_storeu_ps(&b[i+2], ai_2_2_v);
__m128 ai3_v = _mm_loadu_ps(&a[i+3]);
__m128 two3_v = _mm_set1_ps(2);
__m128 ai_3_2_v = _mm_mul_ps(ai3_v, two3_v);
_mm_storeu_ps(&b[i+3], ai_3_2_v);
I am slightly confused about the section of code:
for(;i<n;i++)
{
b[i] = a[i]*2;
}
what does this do? Is it just to do the extra parts for example if the loop is not dividable by the factor you choose to unroll it by? Thank you.
As usual, it is not efficient to unroll loops and try to match SSE instructions manually. Compilers can do it better than you. For example, the provided sample is compiled into SSE-enabled ASM automatically:
Loops can be unrolled as well, it would just make for a longer code, which I do not want to paster here. You can trust me - compilers do unroll loops.
Conclusion
Manual unrolling will do you no good.
The answer is the first block:
It already takes four variables at a time.
Here is the full program with the equivalent section of code commented out:
As for efficiency; it seems that the assembly on my system generates
movups
instructions, whereas the hand rolled code could be made to usemovaps
which should be faster.I used the following program to do some benchmarks:
I got the following results (x86):
NO_UNROLL
: 0.994 seconds, no SSE chosen by compilerUNROLL
: 3.511 seconds, usesmovups
SSE_UNROLL
: 3.315 seconds, usesmovups
SSE_UNROLL_ALIGNED
: 3.276 seconds, usesmovaps
So it is clear that unrolling the loop has not helped in this case. Even ensuring that we use the more efficient
movaps
doesn't help much.But I got an even stranger result when compiling to 64 bit (x64):
NO_UNROLL
: 1.138 seconds, no SSE chosen by compilerUNROLL
: 1.409 seconds, no SSE chosen by compilerSSE_UNROLL
: 1.420 seconds, still no SSE chosen by compiler!SSE_UNROLL_ALIGNED
: 1.476 seconds, still no SSE chosen by compiler!It seems MSVC sees through the proposal and generates better assembly regardless, albeit still slower than had we not tried any hand optimization at all.