SSE Loading & Adding

2019-06-27 05:06发布

问题:

Assume I have two vectors represented by two arrays of type double, each of size 2. I'd like to add corresponding positions. So assume vectors i0 and i1, I'd like to add i0[0] + i1[0] and i0[1] + i1[1] together.

Since the type is double, I would need two registers. The trick would be to put i0[0] and i1[0] , and i0[1] and i1[1] in another and just add the register with itself.

My question is, if I call _mm_load_ps(i0[0]) and then _mm_load_ps(i1[0]), will that place them in the lower and upper 64-bits separately, or will it replace the register with the second load? How would I place both doubles in the same register, so I can call add_ps after?

Thanks,

回答1:

I think what you want is this:

double i0[2];
double i1[2];

__m128d x1 = _mm_load_pd(i0);
__m128d x2 = _mm_load_pd(i1);
__m128d sum = _mm_add_pd(x1, x2);
// do whatever you want to with "sum" now

When you do a _mm_load_pd, it puts the first double into the lower 64 bits of the register and the second into the upper 16 bits. So, after the loads above, x1 holds the two double values i0[0] and i0[1] (and similar for x2). The call to _mm_add_pd vertically adds the corresponding elements in x1 and x2, so after the addition, sum holds i0[0] + i1[0] in its lower 64 bits and i0[1] + i1[1] in its upper 64 bits.

Edit: I should point out that there is no benefit to using _mm_load_pd instead of _mm_load_ps. As the function names indicate, the pd variety explicitly loads two packed doubles and the ps version loads four packed single-precision floats. Since these are purely bit-for-bit memory moves and they both use the SSE floating-point unit, there is no penalty to using _mm_load_ps to load in double data. And, there is a benefit to _mm_load_ps: its instruction encoding is one byte shorter than _mm_load_pd, so it is more efficient from an instruction cache sense (and potentially instruction decoding; I'm not an expert on all of the intricacies of modern x86 processors). The above code using _mm_load_ps would look like:

double i0[2];
double i1[2];

__m128d x1 = (__m128d) _mm_load_ps((float *) i0);
__m128d x2 = (__m128d) _mm_load_ps((float *) i1);
__m128d sum = _mm_add_pd(x1, x2);
// do whatever you want to with "sum" now

There is no function implied by the casts; it simply makes the compiler reinterpret the SSE register's contents as holding doubles instead of floats so that it can be passed into the double-precision arithmetic function _mm_add_pd.



回答2:

The _ps prefix is an abbreviation for "packed single", meaning it is for use with single-precision floating point, not double-precision.

Instead, you want _mm_load_pd(). This function takes a 16-byte aligned pointer to the first member of an array of two doubles, and load them both. So you would use this like so:

__m128d v0 = _mm_load_pd(i0);
__m128d v1 = _mm_load_pd(i1);

v0 = _mm_add_pd(v0, v1);