Does anyone know how to vectorize the following code?
uint32_t r[8];
uint16_t* ptr;
for (int j = 0; j < 8; ++j)
if (r[j] < C)
r[j] = *(ptr++);
It's basically a masked gather operation. The auto-vectorizer can't deal with this. If ptr was a uint32_t* it should be directly realizable with _mm256_mask_i32gather_epi32. But even then how do you generate the correct index vector? And wouldn't it be faster to just use a packed load and shuffling the result anyway (requiring a similar index vector)?
Updated answer: The main piece of code has been rewritten as a function and a solution suitable for AMD processors has been added.
As Peter Cordes mentioned in the comments, the AVX-512 instruction
vpexpandd
would be useful here. The functions_mm256_mask_expand_epi32_AVX2_BMI()
and_mm256_mask_expand_epi32_AVX2()
below more or less emulate this instruction. The AVX2_BMI variant is suitable for Intel Haswell processors and newer. The_mm256_mask_expand_epi32_AVX2()
function is suitable for AMD processors with a slow or lackingpdep
instruction, such as the Ryzen processor. In this function a few instructions with high throughput, such as shifts and simple arithmetic operations, are used instead of thepdep
instruction. Another possibility for AMD processors would be to process only 4 elements at the time, and use a tiny (16 element) lookup-table to retrieve the shuf_mask.Below these two functions it is shown how these can be used to vectorize your scalar code
The answer uses a similar idea as in this answer by Peter Cordes, which discusses left packing based on a mask. In that answer the BMI2 instruction
pext
is used to compute the permutation vector. Here we use thepdep
instruction instead, to compute the permutation vector. Function_mm256_mask_expand_epi32_AVX2()
finds the permutation vector in a different way by computing a prefix sum on ther<C
mask.Because of the unsigned
uint32_t
, I used Paul R's idea forepu32
unsigned comparisons.The output is: