Segmentation fault while working with SSE intrinsi

I am working with SSE intrinsics for the first time and I am encountering a segmentation fault even after ensuring 16byte memory alignment. This post is an extension to my earlier question:

How to allocate 16byte memory aligned data

This is how I have declared my array:

  float *V = (float*) memalign(16,dx*sizeof(float));

When I try to do this:

  __m128 v_i = _mm_load_ps(&V[i]); //It works

But when I do this:

  __m128 u1 = _mm_load_ps(&V[(i-1)]); //There is a segmentation fault

But if I do :

  __m128 u1 = _mm_loadu_ps(&V[(i-1)]); //It works again

However I want to eliminate using _mm_loadu_ps and want to make it work using _mm_load_ps only.

I am working with the Intel icc compiler.

How do I resolve this issue?

UPDATE:

using both operations in the following code:

  void FDTD_base (float *V, float *U, int dx, float c0, float c1, float c2, float c3,     float c4)
    {
       int i, j, k;
                    for (i = 4; i < dx-4; i++)
                    {

                            U[i] = (c0 * (V[i]) //center
                                    + c1 * (V[(i-1)] + V[(i+1)] )
                                    + c2 * (V[(i-2)] + V[(i+2)] )
                                    + c3 * (V[(i-3)] + V[(i+3)] )
                                    + c4 * (V[(i-4)] + V[(i+4)] ));
                    }

       }

SSE version:

         for (i=4; i < dx-4; i+=4)
        {
            v_i = _mm_load_ps(&V[i]);
            __m128 center = _mm_mul_ps(v_i,c0_i);

            __m128 u1 = _mm_loadu_ps(&V[(i-1)]);
            u2 = _mm_loadu_ps(&V[(i+1)]);

            u3 = _mm_loadu_ps(&V[(i-2)]);
            u4 = _mm_loadu_ps(&V[(i+2)]);

            u5 = _mm_loadu_ps(&V[(i-3)]);
            u6 = _mm_loadu_ps(&V[(i+3)]);

            u7 = _mm_load_ps(&V[(i-4)]);
            u8 = _mm_load_ps(&V[(i+4)]);

            __m128 tmp1 = _mm_add_ps(u1,u2);
            __m128 tmp2 = _mm_add_ps(u3,u4);
            __m128 tmp3 = _mm_add_ps(u5,u6);
            __m128 tmp4 = _mm_add_ps(u7,u8);

            __m128 tmp5 = _mm_mul_ps(tmp1,c1_i);
            __m128 tmp6 = _mm_mul_ps(tmp2,c2_i);
            __m128 tmp7 = _mm_mul_ps(tmp3,c3_i);
            __m128 tmp8 = _mm_mul_ps(tmp4,c4_i);

            __m128 tmp9 = _mm_add_ps(tmp5,tmp6);
            __m128 tmp10 = _mm_add_ps(tmp7,tmp8);

            __m128 tmp11 = _mm_add_ps(tmp9,tmp10);
            __m128 tmp12 = _mm_add_ps(center,tmp11);

            _mm_store_ps(&U[i], tmp12);
    }

Is there a more efficient way of doing this using only _mm_load_ps() ?

Since sizeof(float) is 4, only every fourth entry in V will be properly aligned. Remember that _mm_load_ps loads four floats at a time. The argument, i.e. the pointer to the first float, needs to be aligned to 16 bytes.

I'm assuming that in your example i is a multiple of four, otherwise _mm_load_ps(&V[i]) would fail.

Update

This is how I would suggest implementing the above sliding window example using aligned loads and shuffles:

__m128 v_im1;
__m128 v_i = _mm_load_ps( &V[0] );
__m128 v_ip1 = _mm_load_ps( &V[4] );

for ( i = 4 ; i < dx ; i += 4 ) {

    /* Get the three vectors in this 'frame'. */
    v_im1 = v_i; v_i = v_ip1; v_ip1 = _mm_load_ps( &V[i+4] );

    /* Get the u1..u8 from the example code. */
    __m128 u3 = _mm_shuffle_ps( v_im1 , v_i , 3 + (4<<2) + (0<<4) + (1<<6) );
    __m128 u4 = _mm_shuffle_ps( v_i , v_ip1 , 3 + (4<<2) + (0<<4) + (1<<6) );

    __m128 u1 = _mm_shuffle_ps( u3 , v_i , 1 + (2<<2) + (1<<4) + (2<<6) );
    __m128 u2 = _mm_shuffle_ps( v_i , u4 , 1 + (2<<2) + (1<<4) + (2<<6) );

    __m128 u5 = _mm_shuffle_ps( v_im1 , u3 , 1 + (2<<2) + (1<<4) + (2<<6) );
    __m128 u6 = _mm_shuffle_ps( u4 , v_ip1 , 1 + (2<<2) + (1<<4) + (2<<6) );

    __m128 u7 = v_im1;
    __m128 u8 = v_ip1;

    /* Do your computation and store. */
    ...

    }

Note that this is a bit tricky since _mm_shuffle_ps can only take two values from each argument, which is why we first need to make u3 and u4 in order to re-use them for the other values with different overlaps.

Note also that the values u1, u3, and u5 can also be recovered from u2, u4 and u6 in the previous iteration.

Note, finally, that I have not verified the above code! Read the documentation for _mm_shuffle_ps and check that the third argument, the selector, is correct for each case.