Let's start by including the following:
#include <vector>
#include <random>
using namespace std;
Now, suppose that one has the following three std:vector<float>
:
N = 1048576;
vector<float> a(N);
vector<float> b(N);
vector<float> c(N);
default_random_engine randomGenerator(time(0));
uniform_real_distribution<float> diceroll(0.0f, 1.0f);
for(int i-0; i<N; i++)
{
a[i] = diceroll(randomGenerator);
b[i] = diceroll(randomGenerator);
}
Now, assume that one needs to sum a
and b
element-wise and store the result in c
, which in scalar form looks like the following:
for(int i=0; i<N; i++)
{
c[i] = a[i] + b[i];
}
What would be the SSE2 vectorized version of the above code, keeping in mind that the inputs are a
and b
as defined above (i.e. as a collection of float
) and ehe output is c
(also a collection of float
)?
After quite a bit of research, I was able to come up with the following:
for(int i=0; i<N; i+=4)
{
float a_toload[4] = { a[i], a[i + 1], a[i + 2], a[i + 3] };
float b_toload[4] = { b[i], b[i + 1], b[i + 2], b[i + 3] };
__m128 loaded_a = _mm_loadu_ps(a_toload);
__m128 loaded_b = _mm_loadu_ps(b_toload);
float result[4] = { 0, 0, 0, 0 };
_mm_storeu_ps(result, _mm_add_ps(loaded_a , loaded_b));
c[i] = result[0];
c[i + 1] = result[1];
c[i + 2] = result[2];
c[i + 3] = result[3];
}
However, this seems to be really cumbersome and is certainly quite inefficient: the SIMD version above is actually three times slower than the initial scalar version (measured, of course, with optimizations on, in release mode of Microsoft VS15, and after 1 million iterations, not just 12).
You don't need the intermediate arrays to load to the SSE registers. Just load directly from your arrays.
You could also omit the two
loaded
variables and incorporate those into the add, although the compiler should do that for you.You need to be careful with this, as it won't work right if the vector sizes are not a multiple of 4 (you'll access past the end of the array, resulting in Undefined Behavior, and the write past the end of
c
could be damaging).Your for loop could be simplified to
Some additional explanation:
1, A small loop handling the last several floats is quit common and when
N%4 != 0
or N is unknown at compile time it is mandatory.2, I notice that you choose unaligned version load/store, there is small penalty compared to aligned version. I found this link at stackoverflow: Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?