I have very long byte arrays that need to be added to a destination array of type short
(or int
).
Does such SSE instruction exist? Or maybe their set ?
问题:
回答1:
You need to unpack each vector of 8 bit values to two vectors of 16 bit values and then add those.
__m128i v = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
__m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); // vl = { 7, 6, 5, 4, 3, 2, 1, 0 }
__m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); // vh = { 15, 14, 13, 12, 11, 10, 9, 8 }
where v
is a vector of 16 x 8 bit values and vl
, vh
are the two unpacked vectors of 8 x 16 bit values.
Note that I'm assuming that the 8 bit values are unsigned so when unpacking to 16 bits the high byte is set to 0 (i.e. no sign extension).
If you want to sum a lot of these vectors and get a 32 bit result then a useful trick is to use _mm_madd_epi16
with a multiplier of 1, e.g.
__m128i vsuml = _mm_set1_epi32(0);
__m128i vsumh = _mm_set1_epi32(0);
__m128i vsum;
int sum;
for (int i = 0; i < N; i += 16)
{
__m128i v = _mm_load_si128(&x[i]);
__m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0));
__m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0));
vsuml = _mm_add_epi32(vsuml, _mm_madd_epi16(vl, _mm_set1_epi16(1)));
vsumh = _mm_add_epi32(vsumh, _mm_madd_epi16(vh, _mm_set1_epi16(1)));
}
// do horizontal sum of 4 partial sums and store in scalar int
vsum = _mm_add_epi32(vsuml, vsumh);
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
sum = _mm_cvtsi128_si32(vsum);
回答2:
If you need to sign-extend your byte vectors instead of zero-extend, use pmovsxbw
(_mm_cvtepi8_epi16
). Unlike the unpack hi/lo instructions, you can only pmovsx from the low half/quarter/eighth of a src register.
You can pmovsx directly from memory though, even though intrinsics make this really clumsy. Since shuffle throughput is more limited than load throughput on most CPUs, it's probably preferable to do two load+pmovsx than to do one load + three shuffles.