AVX2, How to Efficiently Load Four Integers to Eve

2019-04-07 00:57发布

问题:

I have an aligned array of integers in memory containing indices I0, I1, I2, I3. My goal is to get them into a __m256i register containing I0, I0 + 1, I1, I1 + 1, I2, I2 + 1, I3, I3 + 1. The hard part is getting them into the 256 bit register as I0, I0, I1, I1, I2, I2, I3, I3, after which I can just add a register containing 0, 1, 0, 1, 0, 1, 0, 1.

I found the intrinsic, _mm256_castsi128_si256, which lets me load the 4 integers into the lower 128 bits of the 256 bit register, but I'm struggling to find the best intrinsics to use from there.

Any help would be appreciated. I have access to all SSE versions, AVX, and AVX2 and would like to do this using intrinsics only.

Edit:

I think this works, but I'm not how efficient it is...in the process of testing it.

// _mm128_load_si128: Loads 4 integer values into a temporary 128bit register.
// _mm256_broadcastsi128_si256: Copies 4 integer values in the 128 bit register to the low and high 128 bits of the 256 bit register.
__m256i tmpStuff = _mm256_broadcastsi128_si256 ((_mm_load_si128((__m128i*) indicesArray)));

// _mm256_unpacklo_epi32: Interleaves the integer values of source0 and source1.
__m256i indices = _mm256_unpacklo_epi32(tmpStuff, tmpStuff);

__m256i regToAdd = _mm256_set_epi32 (0, 1, 0, 1, 0, 1, 0, 1);
indices = _mm256_add_epi32(indices, regToAdd);

Edit2: The above code does not work because _mm256_unpacklo_epi32 does not behave the way I thought. The code above will result in I0, I0+1, I1, I1+1, I0,I0+1, I1, I1+1.

Edit3: The following code works, though again I'm not sure if it's the most efficient:

__m256i tmpStuff = _mm256_castsi128_si256(_mm_loadu_si128((__m128i*) indicesArray));
__m256i mask = _mm256_set_epi32 (3, 3, 2, 2, 1, 1, 0, 0);
__m256i indices= _mm256_permutevar8x32_epi32(tmpStuff, mask);
__m256i regToAdd = _mm256_set_epi32 (1, 0, 1, 0, 1, 0, 1, 0); // Set in reverse order.
indices= _mm256_add_epi32(indices, regToAdd);

回答1:

Your _mm256_permutevar8x32_epi32 version looks ideal for Intel CPUs, unless I'm missing a way that could fold the shuffle into a 128b load. That could help slightly for fused-domain uop throughput, but not for unfused-domain.

1 load (vmovdqa), 1 shuffle (vpermd, aka _mm256_permutevar8x32_epi32) and 1 add (vpaddd) is pretty light-weight. On Intel, lane-crossing shuffles have extra latency but no worse throughput. On AMD Ryzen, lane-crossing shuffles are more expensive. (http://agner.org/optimize/).

Since you can use AVX2, your solution is great if loading a shuffle mask for vpermd isn't a problem. (register pressure / cache misses).

Beware that _mm256_castsi128_si256 doesn't guarantee the high half of the __m256i is all zero. But you don't depend on this, so your code is totally fine.


BTW, you could use one 256-bit load and unpack it 2 different ways with vpermd. Use another mask with all elements 4 higher.


Another option is an unaligned 256b load with the lane-split in the middle of your 4 elements, so you have 2 elements at the bottom of the high lane and 2 at the top of the low lane. Then you can use an in-lane shuffle to put your data where it's needed. But it's a different shuffle in each lane, so you will still need a shuffle that takes the control operand in a register (not an immediate) to do it in a single operation. (vpshufd and vpermilps imm8 recycle the same immediate for both lanes.) The only shuffles where different bits of the immediate affect the upper / lower lane separately are qword granularity shuffles like vpermq (_mm256_permutex_epi64, not permutexvar).

You could use vpermilps ymm,ymm,ymm, or vpshufb (_mm256_shuffle_epi8) for this, which will run more efficiently on Ryzen than a lane-crossing vpermd (probably 3 uops / 1 per 4c throughput if it's the same as vpermps, according to Agner Fog

But using an unaligned load is not appealing when your data is already aligned, and all it gains is an in-lane vs. lane-crossing shuffle. If you'd needed a 16 or 8-bit granularity shuffle, it would probably be worth it (because there is no lane-crossing byte or word shuffle until AVX512, and on Skylake-AVX512 vpermw is multiple uops.)


An alternative that avoids a shuffle-mask vector constant, but is worse performance (because it takes twice as many shuffles):

vpmovzxdq is another option for getting the upper two elements into the upper 128bit lane.

; slow, not recommended.  Avoids using a register for shuffle-control, though.
vpmovzxdq  ymm0, [src]
vpshufd    ymm1, ymm0, _MM_SHUFFLE(2,2, 0,0)   ; duplicate elements
vpaddd     ...

Or, possibly higher throughput than the 2-shuffle version above if the shuffle port is a bottleneck for the whole loop. (Still worse than the vpermd version, though.)

; slow, not recommended.
vpmovzxdq  ymm0, [src]
vpsllq     ymm1, ymm0,32          ; left shift by 32
vpor       ymm0, ymm0, odd_ones   ; OR with set1_epi64x(1ULL << 32)
vpaddd     ymm0, ymm0, ymm1       ; I_n+0 in even elements, 1+I_n in odd

This has some instruction-level parallelism: the OR can run in parallel with the shift. But it still sucks for being more uops; if you're out of vector regs is probably still best to use a shuffle-control vector from memory.



标签: x86 sse simd avx avx2