可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have the following code snippet:

#include <cstdio>
#include <cstdint>

static const size_t ARR_SIZE = 129;

int main()
{
  uint32_t value = 2570980487;

  uint32_t arr[ARR_SIZE];
  for (int x = 0; x < ARR_SIZE; ++x)
    arr[x] = value;

  float arr_dst[ARR_SIZE];
  for (int x = 0; x < ARR_SIZE; ++x)
  {
    arr_dst[x] = static_cast<float>(arr[x]);
  }

  printf("%s\n", arr_dst[ARR_SIZE - 1] == arr_dst[ARR_SIZE - 2] ? "OK" : "WTF??!!");

  printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 2]);
  printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 1]);
  return 0;
}

If I compile it under MS Visual Studio 2015 I can see that the output is:

WTF??!!
magic = 2570980352.0000000000
magic = 2570980608.0000000000

So the last arr_dst element is different from the previous one, yet these two values were obtained by converting the same value, which populates the arr array! Is it a bug?

I noticed that if I modify the conversion loop in the following manner, I get the "OK" result:

for (int x = 0; x < ARR_SIZE; ++x)
{
  if (x == 0)
    x = 0;
  arr_dst[x] = static_cast<float>(arr[x]);
}

So this probably is some issue with vectorizing optimisation.

This behavior does not reproduce on gcc 4.8. Any ideas?

回答1:

I did an investigation on a PowerPC imeplementation (Freescale MCP7450) as they IMHO are far better documented than any voodoo Intel comes up with.

As it turns out the floating point unit, FPU, and vector unit may have different rounding for floating point operations. The FPU can be configured to use one of four rounding modes; round to nearest (default), truncate, towards positive infinity and towards negative infinity. The vector unit however is only able to round to nearest, with a few select instructions having specific rounding rules. The internal precision of the FPU is 106-bit. The vector unit fulfills IEEE-754 but the documentation does not state much more.

Looking at your result the conversion 2570980608 is closer to the original integer, suggesting the FPU has better internal precision than the vector unit OR different rounding modes.

回答2:

A 32-bit IEEE-754 binary float, such as MSVC++ uses, provides only 6-7 decimal digits of precision. Your starting value is well within the range of that type, but it seems not to be exactly representable by that type, as indeed is the case for most values of type uint32_t.

At the same time, the floating-point unit of an x86 or x86_64 processor uses a wider representation even than MSVC++'s 64-bit double. It seems likely that after the loop exits, the last-computed array element remains in an FPU register, in its extended precision form. The program may then use that value directly from the register instead of reading it back from memory, which it is obligated to do with previous elements.

If the program performs the == comparison by promoting the narrower representation to the wider instead of the other way around, then the two values might indeed compare unequal, as the round-trip from extended precision to float and back loses precision. In any event, both values are converted to type double when passed to printf(); if indeed they compared unequal, then it is likely that the results of those conversions differ, too.

I'm not up on MSVC++ compile options, but very likely there is one that would quash this behavior. Such options sometimes go by names such as "strict math" or "strict fp". Be aware, however, that turning on such an option (or turning off its opposite) can be very costly in an FP-heavy program.

回答3:

Converting between unsigned and float is not simple on x86; there's no single instruction for it (until AVX512). A common technique is to convert as signed and then fixup the result. There are multiple ways of doing this. (See this Q&A for some manually-vectorized methods with C intrinsics, not all of which have perfectly-rounded results.)

MSVC vectorizes the first 128 with one strategy, and then uses a different strategy (which wouldn't vectorize) for the last scalar element, which involves converting to double and then from double to float.

gcc and clang produce the 2570980608.0 result from their vectorized and scalar methods. 2570980608 - 2570980487 = 121, and 2570980487 - 2570980352 = 135 (with no rounding of inputs/outputs), so gcc and clang produce the correctly rounded result in this case (less than 0.5ulp of error). IDK if that's true for every possible uint32_t (but there are only 2^32 of them, we could exhaustively check). MSVC's end result for the vectorized loop has slightly more than 0.5ulp of error, but the scalar method is correctly rounded for this input.

IEEE math demands that + - * / and sqrt produce correctly rounded results (less than 0.5ulp of error), but other functions (like log) don't have such a strict requirement. IDK what the requirements are on rounding for int->float conversions, so IDK if what MSVC does is strictly legal (if you didn't use /fp:fast or anything).

See also Bruce Dawson's Floating-Point Determinism blog post (part of his excellent series about FP math), although he doesn't mention integer<->FP conversions.

We can see in the asm linked by the OP what MSVC did (stripped down to only the interesting instructions and commented by hand):

; Function compile flags: /Ogtp
# assembler macro constants
_arr_dst$ = -1040                   ; size = 516
_arr$ = -520                        ; size = 516
_main   PROC                        ; COMDAT

  00013      mov     edx, 129
  00018      mov     eax, -1723986809   ; this is your unsigned 2570980487
  0001d      mov     ecx, edx
  00023      lea     edi, DWORD PTR _arr$[esp+1088]  ; edi=arr
  0002a      rep stosd             ; memset in chunks of 4B
  # arr[0..128] = 2570980487 at this point

  0002c      xor     ecx, ecx      ; i = 0
  # xmm2 = 0.0 in each element (i.e. all-zero)
  # xmm3 = __xmm@4f8000004f8000004f8000004f800000  (a constant repeated in each of 4 float elements)


  ####### The vectorized unsigned->float conversion strategy:
  $LL7@main:                                       ; do{
  00030      movups  xmm0, XMMWORD PTR _arr$[esp+ecx*4+1088]  ; load 4 uint32_t
  00038      cvtdq2ps xmm1, xmm0                 ; SIGNED int to Single-precision float
  0003b      movaps  xmm0, xmm1
  0003e      cmpltps xmm0, xmm2                  ; xmm0 = (xmm0 < 0.0)
  00042      andps   xmm0, xmm3                  ; mask the magic constant
  00045      addps   xmm0, xmm1                  ; x += (x<0.0) ? magic_constant : 0.0f;
   # There's no instruction for converting from unsigned to float, so compilers use inconvenient techniques like this to correct the result of converting as signed.
  00048      movups  XMMWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0 ; store 4 floats to arr_dst
  ; and repeat the same thing again, with addresses that are 16B higher (+1104)
  ; i.e. this loop is unrolled by two

  0006a      add     ecx, 8         ;  i+=8 (two vectors of 4 elements)
  0006d      cmp     ecx, 128
  00073      jb  SHORT $LL7@main    ; }while(i<128)

 #### End of vectorized loop
 # and then IDK what MSVC smoking; both these values are known at compile time.  Is /Ogtp not full optimization?
 # I don't see a branch target that would let execution reach this code
 #  other than by falling out of the loop that ends with ecx=128
  00075      cmp     ecx, edx
  00077      jae     $LN21@main     ; if(i>=129): always false

  0007d      sub     edx, ecx       ; edx = 129-128 = 1

... some more ridiculous known-at-compile-time jumping later ...

 ######## The scalar unsigned->float conversion strategy for the last element
$LC15@main:
  00140      mov     eax, DWORD PTR _arr$[esp+ecx*4+1088]
  00147      movd    xmm0, eax
  # eax = xmm0[0] = arr[128]
  0014b      cvtdq2pd xmm0, xmm0        ; convert the last element TO DOUBLE
  0014f      shr     eax, 31            ; shift the sign bit to bit 1, so eax = 0 or 1
     ; then eax indexes a 16B constant, selecting either 0 or 0x41f0... (as whatever double that represents)
  00152      addsd   xmm0, QWORD PTR __xmm@41f00000000000000000000000000000[eax*8]
  0015b      cvtpd2ps xmm0, xmm0        ; double -> float
  0015f      movss   DWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0  ; and store it

  00165      inc     ecx            ;   ++i;
  00166      cmp     ecx, 129       ; } while(i<129)
  0016c      jb  SHORT $LC15@main
  # Yes, this is a loop, which always runs exactly once for the last element

By way of comparison, clang and gcc also don't optimize the whole thing away at compile time, but they do realize that they don't need a cleanup loop, and just do a single scalar store or convert after the respective loops. (clang actually fully unrolls everything unless you tell it not to.)

See the code on the Godbolt compiler explorer.

gcc just converts the upper and lower 16b halves to float separately, and combines them with a multiply by 65536 and add.

Clang's unsigned -> float conversion strategy is interesting: it never uses a cvt instruction at all. I think it stuffs the two 16-bit halves of the unsigned integer into the mantissa of two floats directly (with some tricks to set the exponents (bitwise boolean stuff and an ADDPS), then adds the low and high half together like gcc does.

Of course, if you compile to 64-bit code, the scalar conversion can just zero-extend the uint32_t to 64-bit and convert that as a signed int64_t to float. Signed int64_t can represent every value of uint32_t, and x86 can convert a 64-bit signed int to float efficiently. But that doesn't vectorize.