Slower SSE performance on large array sizes

I am new to SSE programming so I am hoping someone out there can help me. I recently implemented a function using GCC SSE intrinsics to compute the sum of an array of 32-bit integers. The code for my implementation is given below.

int ssum(const int *d, unsigned int len)
{
  static const unsigned int BLOCKSIZE=4;
  unsigned int i,remainder;
  int output;
  __m128i xmm0, accumulator;
  __m128i* src;

  remainder = len%BLOCKSIZE;
  src = (__m128i*)d;
  accumulator = _mm_loadu_si128(src);

  output = 0;
  for(i=BLOCKSIZE;i<len-remainder;i+=BLOCKSIZE){
    xmm0 = _mm_loadu_si128(++src);
    accumulator = _mm_add_epi32(accumulator,xmm0);
  }

  accumulator = _mm_add_epi32(accumulator, _mm_srli_si128(accumulator, 8));
  accumulator = _mm_add_epi32(accumulator, _mm_srli_si128(accumulator, 4));
  output = _mm_cvtsi128_si32(accumulator);


  for(i=len-remainder;i<len;i++){
    output += d[i];
  }
  return output;
}

As you can see, it is a fairly straight forward implementation where I sum the array 4 at a time using the extended xmm registers and then clean up at the end by adding up the remaining elements.

I then compared the performance of this SIMD implementation against just a plain for loop. The result of this experiment is available here:

SIMD vs. for-loop

As you can see, in comparison to a for loop, this implementation does indeed show about ~60% speedup for a input sizes ( meaning the length of the array ) upto about 5M elements. However, for larger values of the input size the performance, in relation to a for loop, takes a dramatic dive and produces only about a 20% speed up.

I am at a loss to explain this dramatic decrease in performance. I am more or less stepping linearly through memory so the affect of cache misses and page faults should be about the same for both implementations. What am I missing here? Is there any way we can flatten that curve out? Any thoughts would be greatly appreciated.

For large input, the data is outside the cache, and the code is memory bounded.
For small input, the data is inside the cache (i.e L1 / L2 / L3 cache), and the code is computation bounded.
I assume you didn't try to flush the cache, before performance measurement.

The cache memory is inside the CPU, and the bandwidth between cache memory and ALU (or SSE) units is very high (high bandwidth - less time transferring data).
Your highest level cache (i.e L3) size is about 4MB to 8MB (depending your CPU model).
Larger amount of data must be located on the DDR SDRAM, witch is external RAM (outside the CPU).
The CPU is connected to the DDR SDRAM with memory bus, with has much lower bandwidth than the cache memory.

Example:
Assume your external RAM type is Dual Channel DDR3 SDRAM 1600. The maximum theoretical bandwidth between external RAM and CPU is about 25GB/Sec.

Reading 100MBytes of data (at 25GB/S) from the RAM to the CPU takes about 100e6 / 25e9 = 4msec.
From my experience the utilized bandwidth is about half of theoretical bandwidth, so the reading time is about 8msec.

The computation time is shorter:
Assume each iteration of your loop takes about 2 CPU clocks (just an example).
Each iteration process 16 bytes of data.
Total CPU clocks for processing 100MB takes about (100e6 / 16)*2 = 12500000 clks.
Assume CPU frequency is 3GHz.
Total SSE processing time is about 12500000 / 3e9 = 4.2msec.

As you can see, reading the data from external RAM takes twice as much as SSE computation time.

Since the data transfer and computation occur in parallel, the total time is the maximum of 4.2mesc and 8msec (i.e 8msec).

Lets assume loop without using SSE takes twice as much computation time, so without using SSE the computation time is about 8.4msec.

In the above example the total improvement of using SSE is about 0.4msec.

Note: The selected numbers are just for example purposes.

Benchmarks:
I did some benchmarks on my system.
I am using Windows 10 and Visual Studio 2010.
Benchmark test: Summing 100MBytes of data (summing 25*1024^2 32bits integers).

CPU

Intel Core i5 3550 (Ivy Bridge).
CPU Base frequency is 3.3GHz.
Actual Core Speed during the test: 3.6GHz (Turbo boost is enabled).
L1 data cache size: 32KBytes.
L2 cache size: 256Bytes (single core L2 cache size).
L3 cache size: 6MBytes.

Memory:

8GB DDR3 Dual channel.
RAM Frequency: 666MHz (equivalent to 1333MHz without DDR).
Memory theoretical maximum bandwidth: (128*1333/8) / 1024 = 20.8GBytes/Sec.

Sum 100MB as large chunk with SSE (data in external RAM).
Processing time: 6.22msec
Sum 1KB 100 times with SSE (data inside cache).
Processing time: 3.86msec
Sum 100MB as large chunk without SSE (data in external RAM).
Processing time: 8.1msec
Sum 1KB 100 times without SSE (data inside cache).
Processing time: 4.73msec

Utilized memory bandwidth: 100/6.22 = 16GB/Sec (dividing data size by time).
Average clocks per iteration with SSE (data in cache): (3.6e9*3.86e-3)/(25/4*1024^2) = 2.1 clks/iteration (dividing total CPU clocks by number of iterations).