How to vectorize a distance calculation using SSE2

A and B are vectors or length N, where N could be in the range 20 to 200 say. I want to calculate the square of the distance between these vectors, i.e. d^2 = ||A-B||^2.

So far I have:

float* a = ...;
float* b = ...;
float d2 = 0;

for(int k = 0; k < N; ++k)
{
    float d = a[k] - b[k];
    d2 += d * d;
}

That seems to work fine, except that I have profiled my code and this is the bottleneck (more than 50% of time is spent just doing this). I am using Visual Studio 2012, on Win 7, with these optimization options: /O2 /Oi /Ot /Oy-. My understanding is that VS2012 should auto-vectorize that loop (using SSE2). However if I insert #pragma loop(no_vector) in the code I don't get a noticable slow down, so I guess the loop is not being vectorized. The compiler confirms that with this message:

  info C5002: loop not vectorized due to reason '1105'

My questions are:

Is it possible to fix this code so that VS2012 can vectorize it?
If not, would it make sense to try to vectorize the code myself?
Can you recommend a web site for me to learn about SSE2 coding?
Is there some value of N below which vectorization would be counter productive?
What is reason '1105'?

标签： c++ visual-c++ optimization vectorization sse2

2条回答

虎瘦雄心在

2楼-- · 2019-02-10 21:34

It's pretty straightforward to implement this using SSE intrinsics:

#include "pmmintrin.h"

__m128 vd2 = _mm_set1_ps(0.0f);
float d2 = 0.0f;
int k;

// process 4 elements per iteration
for (k = 0; k < N - 3; k += 4)
{
    __m128 va = _mm_loadu_ps(&a[k]);
    __m128 vb = _mm_loadu_ps(&b[k]);
    __m128 vd = _mm_sub_ps(va, vb);
    vd = _mm_mul_ps(vd, vd);
    vd2 = _mm_add_ps(vd2, vd);
}

// horizontal sum of 4 partial dot products
vd2 = _mm_hadd_ps(vd2, vd2);
vd2 = _mm_hadd_ps(vd2, vd2);
_mm_store_ss(&d2, vd2);

// clean up any remaining elements
for ( ; k < N; ++k)
{
    float d = a[k] - b[k];
    d2 += d * d;
}

Note that if you can guarantee that a and b are 16 byte aligned then you can use _mm_load_ps rather than _mm_loadu_ps which may help performance, particularly on older (pre Nehalem) CPUs.

Note also that for loops such as this where the are very few arithmetic instructions relative to the number of loads then performance may well be limited by memory bandwidth and the expected speed-up from vectorization may not be realised in practice.

0人赞添加讨论(0) 举报

We Are One

3楼-- · 2019-02-10 21:39

From the MSDN documentation, the 1105 error code means the compiler is not able to figure out how to reduce the code to vectorized instructions. For floating point operations it is indicated that you need to specify the /fp:fast option to enable any floating point reductions at all.

0人赞添加讨论(0) 举报

How to vectorize a distance calculation using SSE2

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间