The correct way to sum two arrays with SSE2 SIMD i

Let's start by including the following:

#include <vector>
#include <random>
using namespace std;

Now, suppose that one has the following three std:vector<float>:

N = 1048576;
vector<float> a(N);
vector<float> b(N);
vector<float> c(N);

default_random_engine randomGenerator(time(0));
uniform_real_distribution<float> diceroll(0.0f, 1.0f);
for(int i-0; i<N; i++)
{
    a[i] = diceroll(randomGenerator);
    b[i] = diceroll(randomGenerator);
}

Now, assume that one needs to sum a and b element-wise and store the result in c, which in scalar form looks like the following:

for(int i=0; i<N; i++)
{
    c[i] = a[i] + b[i];
}

What would be the SSE2 vectorized version of the above code, keeping in mind that the inputs are a and b as defined above (i.e. as a collection of float) and ehe output is c (also a collection of float)?

After quite a bit of research, I was able to come up with the following:

for(int i=0; i<N; i+=4)
{
    float a_toload[4] = { a[i], a[i + 1], a[i + 2], a[i + 3] };
    float b_toload[4] = { b[i], b[i + 1], b[i + 2], b[i + 3] };
    __m128 loaded_a = _mm_loadu_ps(a_toload);
    __m128 loaded_b = _mm_loadu_ps(b_toload);

    float result[4] = { 0, 0, 0, 0 };
    _mm_storeu_ps(result, _mm_add_ps(loaded_a , loaded_b));
    c[i] = result[0];
    c[i + 1] = result[1];
    c[i + 2] = result[2];
    c[i + 3] = result[3];
}

However, this seems to be really cumbersome and is certainly quite inefficient: the SIMD version above is actually three times slower than the initial scalar version (measured, of course, with optimizations on, in release mode of Microsoft VS15, and after 1 million iterations, not just 12).

标签： c++ arrays sum sse simd

2条回答

【Aperson】

2楼-- · 2019-06-03 16:00

You don't need the intermediate arrays to load to the SSE registers. Just load directly from your arrays.

auto loaded_a = _mm_loadu_ps(&a[i]);
auto loaded_b = _mm_loadu_ps(&b[i]);
_mm_storeu_ps(&c[i], _mm_add_ps(loaded_a, loaded_b));

You could also omit the two loaded variables and incorporate those into the add, although the compiler should do that for you.

You need to be careful with this, as it won't work right if the vector sizes are not a multiple of 4 (you'll access past the end of the array, resulting in Undefined Behavior, and the write past the end of c could be damaging).

0人赞添加讨论(0) 举报

男人必须洒脱

3楼-- · 2019-06-03 16:01

Your for loop could be simplified to

const int aligendN = N - N % 4;
for (int i = 0; i < alignedN; i+=4) {
    _mm_storeu_ps(&c[i], 
                  _mm_add_ps(_mm_loadu_ps(&a[i]), 
                  _mm_loadu_ps(&b[i])));
}
for (int i = alignedN; i < N; ++i) {
    c[i] = a[i] + b[i];
}

Some additional explanation:
1, A small loop handling the last several floats is quit common and when N%4 != 0 or N is unknown at compile time it is mandatory.
2, I notice that you choose unaligned version load/store, there is small penalty compared to aligned version. I found this link at stackoverflow: Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

0人赞添加讨论(0) 举报

The correct way to sum two arrays with SSE2 SIMD i

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间