Assume I have two vectors represented by two arrays of type double
, each of size 2. I'd like to add corresponding positions. So assume vectors i0
and i1
, I'd like to add i0[0] + i1[0]
and i0[1] + i1[1]
together.
Since the type is double
, I would need two registers. The trick would be to put i0[0]
and i1[0]
, and i0[1]
and i1[1]
in another and just add the register with itself.
My question is, if I call _mm_load_ps(i0[0])
and then _mm_load_ps(i1[0])
, will that place them in the lower and upper 64-bits separately, or will it replace the register with the second load
? How would I place both doubles in the same register, so I can call add_ps
after?
Thanks,
I think what you want is this:
double i0[2];
double i1[2];
__m128d x1 = _mm_load_pd(i0);
__m128d x2 = _mm_load_pd(i1);
__m128d sum = _mm_add_pd(x1, x2);
// do whatever you want to with "sum" now
When you do a _mm_load_pd
, it puts the first double into the lower 64 bits of the register and the second into the upper 16 bits. So, after the loads above, x1
holds the two double
values i0[0]
and i0[1]
(and similar for x2
). The call to _mm_add_pd
vertically adds the corresponding elements in x1
and x2
, so after the addition, sum
holds i0[0] + i1[0]
in its lower 64 bits and i0[1] + i1[1]
in its upper 64 bits.
Edit: I should point out that there is no benefit to using _mm_load_pd
instead of _mm_load_ps
. As the function names indicate, the pd
variety explicitly loads two packed doubles and the ps
version loads four packed single-precision floats. Since these are purely bit-for-bit memory moves and they both use the SSE floating-point unit, there is no penalty to using _mm_load_ps
to load in double
data. And, there is a benefit to _mm_load_ps
: its instruction encoding is one byte shorter than _mm_load_pd
, so it is more efficient from an instruction cache sense (and potentially instruction decoding; I'm not an expert on all of the intricacies of modern x86 processors). The above code using _mm_load_ps
would look like:
double i0[2];
double i1[2];
__m128d x1 = (__m128d) _mm_load_ps((float *) i0);
__m128d x2 = (__m128d) _mm_load_ps((float *) i1);
__m128d sum = _mm_add_pd(x1, x2);
// do whatever you want to with "sum" now
There is no function implied by the casts; it simply makes the compiler reinterpret the SSE register's contents as holding doubles instead of floats so that it can be passed into the double-precision arithmetic function _mm_add_pd
.
The _ps
prefix is an abbreviation for "packed single", meaning it is for use with single-precision floating point, not double-precision.
Instead, you want _mm_load_pd()
. This function takes a 16-byte aligned pointer to the first member of an array of two double
s, and load them both. So you would use this like so:
__m128d v0 = _mm_load_pd(i0);
__m128d v1 = _mm_load_pd(i1);
v0 = _mm_add_pd(v0, v1);