可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

In SSE there is a function _mm_cvtepi32_ps(__m128i input) which takes input vector of 32 bits wide signed integers (int32_t) and converts them into floats.

Now, I want to interpret input integers as not signed. But there is no function _mm_cvtepu32_ps and I could not find an implementation of one. Do you know where I can find such a function or at least give a hint on the implementation? To illustrate the the difference in results:

unsigned int a = 2480160505; // 10010011 11010100 00111110 11111001   
float a1 = a; // 01001111 00010011 11010100 00111111;  
float a2 = (signed int)a; // 11001110 11011000 01010111 10000010

回答1:

This functionality exists in AVX-512, but if you can't wait until then the only thing I can suggest is to convert the unsigned int input values into pairs of smaller values, convert these, and then add them together again, e.g.

inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
    __m128i v2 = _mm_srli_epi32(v, 1);     // v2 = v / 2
    __m128i v1 = _mm_sub_epi32(v, v2);     // v1 = v - (v / 2)
    __m128 v2f = _mm_cvtepi32_ps(v2);
    __m128 v1f = _mm_cvtepi32_ps(v1);
    return _mm_add_ps(v2f, v1f); 
}

UPDATE

As noted by @wim in his answer, the above solution fails for an input value of UINT_MAX. Here is a more robust, but slightly less efficient solution, which should work for the full uint32_t input range:

inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
    __m128i v2 = _mm_srli_epi32(v, 1);                 // v2 = v / 2
    __m128i v1 = _mm_and_si128(v, _mm_set1_epi32(1));  // v1 = v & 1
    __m128 v2f = _mm_cvtepi32_ps(v2);
    __m128 v1f = _mm_cvtepi32_ps(v1);
    return _mm_add_ps(_mm_add_ps(v2f, v2f), v1f);      // return 2 * v2 + v1
}

回答2:

I think Paul's answer is nice, but it fails for v=4294967295U (=2^32-1). In that case v2=2^31-1 and v1=2^31. Intrinsic _mm_cvtepi32_ps converts 2^31 to -2.14748365E9 . v2=2^31-1 is converted to 2.14748365E9 and consequently _mm_add_ps returns 0 (due to rounding v1f and v2f are the exact opposite of each other).

The idea of the solution below is to copy the most significant bit of v to v_high. The other bits of v are copied to v_low. v_high is converted to 0 or 2.14748365E9 .

inline __m128 _mm_cvtepu32_v3_ps(const __m128i v)
{
__m128i msk0=_mm_set1_epi32(0x7FFFFFFF);
__m128i zero=_mm_xor_si128(msk0,msk0);
__m128i cnst2_31=_mm_set1_epi32(0x4F000000); /* IEEE representation of float 2^31 */

__m128i v_high=_mm_andnot_si128(msk0,v);
__m128i v_low=_mm_and_si128(msk0,v);
__m128  v_lowf=_mm_cvtepi32_ps(v_low);
__m128i msk1=_mm_cmpeq_epi32(v_high,zero);
__m128  v_highf=_mm_castsi128_ps(_mm_andnot_si128(msk1,cnst2_31));  
__m128  v_sum=_mm_add_ps(v_lowf,v_highf);
return v_sum;

}

Update

It was possible to reduce the number of instructions:

inline __m128 _mm_cvtepu32_v4_ps(const __m128i v)
{
__m128i msk0=_mm_set1_epi32(0x7FFFFFFF);
__m128i cnst2_31=_mm_set1_epi32(0x4F000000);

__m128i msk1=_mm_srai_epi32(v,31);
__m128i v_low=_mm_and_si128(msk0,v);
__m128  v_lowf=_mm_cvtepi32_ps(v_low);
__m128  v_highf=_mm_castsi128_ps(_mm_and_si128(msk1,cnst2_31));  
__m128  v_sum=_mm_add_ps(v_lowf,v_highf);
return v_sum;
}

Intrinsic _mm_srai_epi32 shifts the most significant bit of v to the right, while shifting in sign bits, which turns out to be quite useful here.

回答3:

With Paul R's solution and with my previous solution the difference between the rounded floating point and the original integer is less than or equal to 0.75 ULP (Unit in the Last Place). In these methods at two places rounding may occur: in _mm_cvtepi32_ps and in _mm_add_ps. This leads to results that are not as accurate as possible for some inputs.

For example, with Paul R's method 0x2000003=33554435 is converted to 33554432.0, but 33554436.0 also exists as a float, which would have been better here. My previous solution suffers from similar inaccuracies. Such inaccurate results may also occur with compiler generated code, see here.

Following the approach of gcc (see Peter Cordes' answer to that other SO question), an accurate conversion within 0.5 ULP is obtained:

inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
    __m128i msk_lo    = _mm_set1_epi32(0xFFFF);
    __m128  cnst65536f= _mm_set1_ps(65536.0f);

    __m128i v_lo      = _mm_and_si128(v,msk_lo);          /* extract the 16 lowest significant bits of v                                   */
    __m128i v_hi      = _mm_srli_epi32(v,16);             /* 16 most significant bits of v                                                 */
    __m128  v_lo_flt  = _mm_cvtepi32_ps(v_lo);            /* No rounding                                                                   */
    __m128  v_hi_flt  = _mm_cvtepi32_ps(v_hi);            /* No rounding                                                                   */
            v_hi_flt  = _mm_mul_ps(cnst65536f,v_hi_flt);  /* No rounding                                                                   */
    return              _mm_add_ps(v_hi_flt,v_lo_flt);    /* Rounding may occur here, mul and add may fuse to fma for haswell and newer    */
}                                                         /* _mm_add_ps is guaranteed to give results with an error of at most 0.5 ULP     */

Note that other high bits/low bits partitions are possible as long as _mm_cvt_ps can convert both pieces to floats without rounding. For example, a partition with 20 high bits and 12 low bits will work equally well.