Converting from Source-based Indices to Destinatio

I'm using AVX2 instructions in some C code.

The VPERMD instruction takes two 8-integer vectors a and idx and generates a third one, dst, by permuting a based on idx. This seems equivalent to dst[i] = a[idx[i]] for i in 0..7. I'm calling this source based, because the move is indexed based on the source.

However, I have my calculated indices in destination based form. This is natural for setting an array, and is equivalent to dst[idx[i]] = a[i] for i in 0..7.

How can I convert from source-based form to destination-based form? An example test case is:

{2 1 0 5 3 4 6 7}    source-based form. 
{2 1 0 4 5 3 6 7}    destination-based equivalent

For this conversion, I'm staying in ymm registers, so that means that destination-based solutions don't work. Even if I were to insert each separately, since it only operates on constant indexes, you can't just set them.

I guess you're implicitly saying that you can't modify your code to calculate source-based indices in the first place? I can't think of anything you can do with x86 SIMD, other than AVX512 scatter instructions that take dst-based indices.

Storing to memory, inverting, and reloading a vector might actually be best. (Or transferring to integer registers directly, not through memory, maybe after a vextracti128 / packusdw so you only need two 64-bit transfers from vector to integer regs: movq and pextrq).

But anyway, then use them as indices to store a counter into an array in memory, and reload that as a vector. This is still slow and ugly, and includes a store-forwarding failure delay. So it's probably worth your while to change your index-generating code to generate source-based shuffle vectors.