I try to vectorize a CBRNG which uses 64bit widening multiplication.
static __inline__ uint64_t mulhilo64(uint64_t a, uint64_t b, uint64_t* hip) {
__uint128_t product = ((__uint128_t)a)*((__uint128_t)b);
*hip = product>>64;
return (uint64_t)product;
}
Is such a multiplication exists in a vectorized form in AVX2?
No. There's no 64 x 64 -> 128 bit arithmetic as a vector instruction. Nor is there a vector mulhi
type instruction (high word result of multiply).
[V]PMULUDQ can do 32 x 32 -> 64 bit by only considering every second 32 bit unsigned element, or unsigned doubleword, as a source, and expanding each 64 bit result into two result elements combined as an unsigned quadword.
The best you can probably hope for right now is Haswell's MULX instruction, which has more flexible register use, and does not affect the flags register - eliminating some stalls.