Howto vblend for 32-bit integer? or: Why is there

I'm using the AVX2 x86 256-bit SIMD extensions. I want to do a 32-bit integer component wise if-then-else instruction. In the Intel documentations such an instruction is called vblend.

The Intel intrinsic guide contains the function _mm256_blendv_epi8. This function does nearly what I need. The only problem is that it works with 8-bit integers. Unfortunately there is no _mm256_blendv_epi32 in docs. My first question is: Why does this function not exist? My second question is: How to emulate it?

After some searching I found _mm256_blendv_ps which does what I want for 32-bit floating points. Further I found cast functions _mm256_castsi256_ps and _mm256_castps_si256 which cast from integers to 32-bit floats and back. Putting these together gives:

inline __m256i _mm256_blendv_epi32 (__m256i a, __m256i b, __m256i mask){
    return _mm256_castps_si256( 
        _mm256_blendv_ps(
            _mm256_castsi256_ps(a),
            _mm256_castsi256_ps(b),
            _mm256_castsi256_ps(mask)
        ) 
    );
}

While this looks like 5 functions, 4 of them are only glorified casts and one maps directly onto a processor instruction. The whole function therefore boils down to one processor instruction.

The real awkward part therefore is that there seems to be a 32-bit blendv, except that the corresponding intrinsic is missing.

Is there some border case where this will fail miserably? For example, what happens when the integer bit pattern happens to represent a floating point NAN? Does blendv simply ignore this or will it raise some signal?

In case this works: Am I correct that there is a 8-bit, a 32-bit and a 64-bit blendv but a 16-bit blendv is missing?

If your mask is already all-zero / all-one for the whole 32-bit element (like a vpcmpgtd result), use _mm256_blendv_epi8 directly.

My code relies on blendv only checking the highest bit.

Then you have two good options:

Broadcast the high bit within each element using an arithmetic right shift by 31 to set up for VPBLENDVB (_mm256_blendv_epi8). i.e. VPSRAD: mask=_mm256_srai_epi32(mask, 31).

VPSRAD is 1-uop on Intel Haswell, for port0. (More throughput on Skylake: p01). If your algorithm bottlenecks on port 0 (e.g. integer multiply and shift), this is not great.

Use VBLENDVPS. You're correct that all the casts are just to keep the compiler happy, and that VBLENDVPS will do exactly what you want in one instruction.

static inline
__m256i blendvps_si256(__m256i a, __m256i b, __m256i mask) {
    __m256 res = _mm256_blendv_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b), _mm256_castsi256_ps(mask));
    return _mm256_castps_si256(res);
}

However, Intel SnB-family CPUs have a bypass-delay latency of 1 cycle when forwarding integer results to the FP blend unit, and another 1c latency when forwarding the blend results to other integer instructions. This might not hurt throughput if latency isn't the bottleneck.

For more about bypass-delay latency, see Agner Fog's microach guide. It's the reason they don't make __m256i intrinsics for FP instructions, and vice versa. Note that since Sandybridge, FP shuffles don't have extra latency to forward from/to instructions like PADDD. So SHUFPS is a great way to combine data from two integer vectors if PUNPCK* or PALIGNR don't do exactly what you want. (SHUFPS on integers can be worth it even on Nehalem, where it does have a 2c penalty both ways, if throughput is your bottleneck).

Try both ways and benchmark. Either way could be better, depending on surrounding code.

Latency might not matter compared to uop throughput / instruction count. Also note that if you're just storing the result to memory, store instructions don't care which domain the data was coming from.

But if you are using this as part of a long dependency chain, then it might be worth the extra instruction to avoid the extra 2 cycles of latency for the data being blended.

Note that if the mask-generation is on the critical path, then VPSRAD's 1 cycle latency is equivalent to the bypass-delay latency, so using an FP blend is only 1 extra cycle of latency for the mask->result chain, vs. 2 extra cycles for the data->result chain.

For example, what happens when the integer bit pattern happens to represent a floating point NAN?

BLENDVPS doesn't care. Intel's insn ref manual fully documents everything an instruction can/can't do, and SIMD Floating-Point Exceptions: None means that this isn't a problem. See also the x86 tag wiki for links to docs.

FP blend/shuffle/bitwise-boolean/load/store instructions don't care about NaNs. Only instructions that do actual FP math (including CMPPS, MINPS, and stuff like that) raise FP exceptions or can possibly slow down with denormals.

Am I correct that there is a 8-bit, a 32-bit and a 64-bit blendv but a 16-bit blendv is missing?

Yes. But there are 32 and 16-bit arithmetic shifts, so it costs at most one extra instruction to use the 8-bit granularity blend. (There is no PSRAQ, so blendv of 64-bit integers is often best done with BLENDVPD, unless maybe the mask-generation is off the critical path and/or the same mask will be reused many times on the critical path.)

The most common use-case is for compare-masks where each element is all-ones or all-zeros already, so you could blend with PAND/PANDN => POR. Of course, clever tricks that leave just the sign-bit of your mask with the truth value can save instructions and latency, especially since variable-blends are somewhat faster than three boolean bitwise instructions. (e.g. ORPS two float vectors to see if they're both non-negative, instead of 2x CMPPS and ORing the masks. This can work great if you don't care about negative zero, or you're happy to treat underflow to -0.0 as negative).