I have an aligned array of integers in memory containing indices I0, I1, I2, I3. My goal is to get them into a __m256i register containing I0, I0 + 1, I1, I1 + 1, I2, I2 + 1, I3, I3 + 1. The hard part is getting them into the 256 bit register as I0, I0, I1, I1, I2, I2, I3, I3, after which I can just add a register containing 0, 1, 0, 1, 0, 1, 0, 1.
I found the intrinsic, _mm256_castsi128_si256, which lets me load the 4 integers into the lower 128 bits of the 256 bit register, but I'm struggling to find the best intrinsics to use from there.
Any help would be appreciated. I have access to all SSE versions, AVX, and AVX2 and would like to do this using intrinsics only.
Edit:
I think this works, but I'm not how efficient it is...in the process of testing it.
// _mm128_load_si128: Loads 4 integer values into a temporary 128bit register.
// _mm256_broadcastsi128_si256: Copies 4 integer values in the 128 bit register to the low and high 128 bits of the 256 bit register.
__m256i tmpStuff = _mm256_broadcastsi128_si256 ((_mm_load_si128((__m128i*) indicesArray)));
// _mm256_unpacklo_epi32: Interleaves the integer values of source0 and source1.
__m256i indices = _mm256_unpacklo_epi32(tmpStuff, tmpStuff);
__m256i regToAdd = _mm256_set_epi32 (0, 1, 0, 1, 0, 1, 0, 1);
indices = _mm256_add_epi32(indices, regToAdd);
Edit2: The above code does not work because _mm256_unpacklo_epi32 does not behave the way I thought. The code above will result in I0, I0+1, I1, I1+1, I0,I0+1, I1, I1+1.
Edit3: The following code works, though again I'm not sure if it's the most efficient:
__m256i tmpStuff = _mm256_castsi128_si256(_mm_loadu_si128((__m128i*) indicesArray));
__m256i mask = _mm256_set_epi32 (3, 3, 2, 2, 1, 1, 0, 0);
__m256i indices= _mm256_permutevar8x32_epi32(tmpStuff, mask);
__m256i regToAdd = _mm256_set_epi32 (1, 0, 1, 0, 1, 0, 1, 0); // Set in reverse order.
indices= _mm256_add_epi32(indices, regToAdd);
Your
_mm256_permutevar8x32_epi32
version looks ideal for Intel CPUs, unless I'm missing a way that could fold the shuffle into a 128b load. That could help slightly for fused-domain uop throughput, but not for unfused-domain.1 load (
vmovdqa
), 1 shuffle (vpermd
, aka_mm256_permutevar8x32_epi32
) and 1 add (vpaffffd
) is pretty light-weight. On Intel, lane-crossing shuffles have extra latency but no worse throughput. On AMD Ryzen, lane-crossing shuffles are more expensive. (http://agner.org/optimize/).Since you can use AVX2, your solution is great if loading a shuffle mask for
vpermd
isn't a problem. (register pressure / cache misses).Beware that
_mm256_castsi128_si256
doesn't guarantee the high half of the__m256i
is all zero. But you don't depend on this, so your code is totally fine.BTW, you could use one 256-bit load and unpack it 2 different ways with
vpermd
. Use anothermask
with all elements 4 higher.Another option is an unaligned 256b load with the lane-split in the middle of your 4 elements, so you have 2 elements at the bottom of the high lane and 2 at the top of the low lane. Then you can use an in-lane shuffle to put your data where it's needed. But it's a different shuffle in each lane, so you will still need a shuffle that takes the control operand in a register (not an immediate) to do it in a single operation. (
vpshufd
andvpermilps imm8
recycle the same immediate for both lanes.) The only shuffles where different bits of the immediate affect the upper / lower lane separately are qword granularity shuffles likevpermq
(_mm256_permutex_epi64
, notpermutexvar
).You could use
vpermilps ymm,ymm,ymm
, orvpshufb
(_mm256_shuffle_epi8
) for this, which will run more efficiently on Ryzen than a lane-crossingvpermd
(probably 3 uops / 1 per 4c throughput if it's the same asvpermps
, according to Agner FogBut using an unaligned load is not appealing when your data is already aligned, and all it gains is an in-lane vs. lane-crossing shuffle. If you'd needed a 16 or 8-bit granularity shuffle, it would probably be worth it (because there is no lane-crossing byte or word shuffle until AVX512, and on Skylake-AVX512
vpermw
is multiple uops.)An alternative that avoids a shuffle-mask vector constant, but is worse performance (because it takes twice as many shuffles):
vpmovzxdq
is another option for getting the upper two elements into the upper 128bit lane.Or, possibly higher throughput than the 2-shuffle version above if the shuffle port is a bottleneck for the whole loop. (Still worse than the
vpermd
version, though.)This has some instruction-level parallelism: the OR can run in parallel with the shift. But it still sucks for being more uops; if you're out of vector regs is probably still best to use a shuffle-control vector from memory.