I have the following code snippet:
#include <cstdio>
#include <cstdint>
static const size_t ARR_SIZE = 129;
int main()
{
uint32_t value = 2570980487;
uint32_t arr[ARR_SIZE];
for (int x = 0; x < ARR_SIZE; ++x)
arr[x] = value;
float arr_dst[ARR_SIZE];
for (int x = 0; x < ARR_SIZE; ++x)
{
arr_dst[x] = static_cast<float>(arr[x]);
}
printf("%s\n", arr_dst[ARR_SIZE - 1] == arr_dst[ARR_SIZE - 2] ? "OK" : "WTF??!!");
printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 2]);
printf("magic = %0.10f\n", arr_dst[ARR_SIZE - 1]);
return 0;
}
If I compile it under MS Visual Studio 2015 I can see that the output is:
WTF??!!
magic = 2570980352.0000000000
magic = 2570980608.0000000000
So the last arr_dst
element is different from the previous one, yet these two values were obtained by converting the same value, which populates the arr array!
Is it a bug?
I noticed that if I modify the conversion loop in the following manner, I get the "OK" result:
for (int x = 0; x < ARR_SIZE; ++x)
{
if (x == 0)
x = 0;
arr_dst[x] = static_cast<float>(arr[x]);
}
So this probably is some issue with vectorizing optimisation.
This behavior does not reproduce on gcc 4.8. Any ideas?
I did an investigation on a PowerPC imeplementation (Freescale MCP7450) as they IMHO are far better documented than any voodoo Intel comes up with.
As it turns out the floating point unit, FPU, and vector unit may have different rounding for floating point operations. The FPU can be configured to use one of four rounding modes; round to nearest (default), truncate, towards positive infinity and towards negative infinity. The vector unit however is only able to round to nearest, with a few select instructions having specific rounding rules. The internal precision of the FPU is 106-bit. The vector unit fulfills IEEE-754 but the documentation does not state much more.
Looking at your result the conversion 2570980608 is closer to the original integer, suggesting the FPU has better internal precision than the vector unit OR different rounding modes.
A 32-bit IEEE-754 binary float, such as MSVC++ uses, provides only 6-7 decimal digits of precision. Your starting value is well within the range of that type, but it seems not to be exactly representable by that type, as indeed is the case for most values of type uint32_t
.
At the same time, the floating-point unit of an x86 or x86_64 processor uses a wider representation even than MSVC++'s 64-bit double
. It seems likely that after the loop exits, the last-computed array element remains in an FPU register, in its extended precision form. The program may then use that value directly from the register instead of reading it back from memory, which it is obligated to do with previous elements.
If the program performs the ==
comparison by promoting the narrower representation to the wider instead of the other way around, then the two values might indeed compare unequal, as the round-trip from extended precision to float
and back loses precision. In any event, both values are converted to type double
when passed to printf()
; if indeed they compared unequal, then it is likely that the results of those conversions differ, too.
I'm not up on MSVC++ compile options, but very likely there is one that would quash this behavior. Such options sometimes go by names such as "strict math" or "strict fp". Be aware, however, that turning on such an option (or turning off its opposite) can be very costly in an FP-heavy program.
Converting between unsigned
and float
is not simple on x86; there's no single instruction for it (until AVX512). A common technique is to convert as signed and then fixup the result. There are multiple ways of doing this. (See this Q&A for some manually-vectorized methods with C intrinsics, not all of which have perfectly-rounded results.)
MSVC vectorizes the first 128 with one strategy, and then uses a different strategy (which wouldn't vectorize) for the last scalar element, which involves converting to double
and then from double
to float
.
gcc and clang produce the 2570980608.0
result from their vectorized and scalar methods. 2570980608 - 2570980487 = 121
, and 2570980487 - 2570980352 = 135
(with no rounding of inputs/outputs), so gcc and clang produce the correctly rounded result in this case (less than 0.5ulp of error). IDK if that's true for every possible uint32_t (but there are only 2^32 of them, we could exhaustively check). MSVC's end result for the vectorized loop has slightly more than 0.5ulp of error, but the scalar method is correctly rounded for this input.
IEEE math demands that +
-
*
/
and sqrt
produce correctly rounded results (less than 0.5ulp of error), but other functions (like log
) don't have such a strict requirement. IDK what the requirements are on rounding for int->float conversions, so IDK if what MSVC does is strictly legal (if you didn't use /fp:fast
or anything).
See also Bruce Dawson's Floating-Point Determinism blog post (part of his excellent series about FP math), although he doesn't mention integer<->FP conversions.
We can see in the asm linked by the OP what MSVC did (stripped down to only the interesting instructions and commented by hand):
; Function compile flags: /Ogtp
# assembler macro constants
_arr_dst$ = -1040 ; size = 516
_arr$ = -520 ; size = 516
_main PROC ; COMDAT
00013 mov edx, 129
00018 mov eax, -1723986809 ; this is your unsigned 2570980487
0001d mov ecx, edx
00023 lea edi, DWORD PTR _arr$[esp+1088] ; edi=arr
0002a rep stosd ; memset in chunks of 4B
# arr[0..128] = 2570980487 at this point
0002c xor ecx, ecx ; i = 0
# xmm2 = 0.0 in each element (i.e. all-zero)
# xmm3 = __xmm@4f8000004f8000004f8000004f800000 (a constant repeated in each of 4 float elements)
####### The vectorized unsigned->float conversion strategy:
$LL7@main: ; do{
00030 movups xmm0, XMMWORD PTR _arr$[esp+ecx*4+1088] ; load 4 uint32_t
00038 cvtdq2ps xmm1, xmm0 ; SIGNED int to Single-precision float
0003b movaps xmm0, xmm1
0003e cmpltps xmm0, xmm2 ; xmm0 = (xmm0 < 0.0)
00042 andps xmm0, xmm3 ; mask the magic constant
00045 addps xmm0, xmm1 ; x += (x<0.0) ? magic_constant : 0.0f;
# There's no instruction for converting from unsigned to float, so compilers use inconvenient techniques like this to correct the result of converting as signed.
00048 movups XMMWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0 ; store 4 floats to arr_dst
; and repeat the same thing again, with addresses that are 16B higher (+1104)
; i.e. this loop is unrolled by two
0006a add ecx, 8 ; i+=8 (two vectors of 4 elements)
0006d cmp ecx, 128
00073 jb SHORT $LL7@main ; }while(i<128)
#### End of vectorized loop
# and then IDK what MSVC smoking; both these values are known at compile time. Is /Ogtp not full optimization?
# I don't see a branch target that would let execution reach this code
# other than by falling out of the loop that ends with ecx=128
00075 cmp ecx, edx
00077 jae $LN21@main ; if(i>=129): always false
0007d sub edx, ecx ; edx = 129-128 = 1
... some more ridiculous known-at-compile-time jumping later ...
######## The scalar unsigned->float conversion strategy for the last element
$LC15@main:
00140 mov eax, DWORD PTR _arr$[esp+ecx*4+1088]
00147 movd xmm0, eax
# eax = xmm0[0] = arr[128]
0014b cvtdq2pd xmm0, xmm0 ; convert the last element TO DOUBLE
0014f shr eax, 31 ; shift the sign bit to bit 1, so eax = 0 or 1
; then eax indexes a 16B constant, selecting either 0 or 0x41f0... (as whatever double that represents)
00152 addsd xmm0, QWORD PTR __xmm@41f00000000000000000000000000000[eax*8]
0015b cvtpd2ps xmm0, xmm0 ; double -> float
0015f movss DWORD PTR _arr_dst$[esp+ecx*4+1088], xmm0 ; and store it
00165 inc ecx ; ++i;
00166 cmp ecx, 129 ; } while(i<129)
0016c jb SHORT $LC15@main
# Yes, this is a loop, which always runs exactly once for the last element
By way of comparison, clang and gcc also don't optimize the whole thing away at compile time, but they do realize that they don't need a cleanup loop, and just do a single scalar store or convert after the respective loops. (clang actually fully unrolls everything unless you tell it not to.)
See the code on the Godbolt compiler explorer.
gcc just converts the upper and lower 16b halves to float separately, and combines them with a multiply by 65536 and add.
Clang's unsigned
-> float
conversion strategy is interesting: it never uses a cvt
instruction at all. I think it stuffs the two 16-bit halves of the unsigned integer into the mantissa of two floats directly (with some tricks to set the exponents (bitwise boolean stuff and an ADDPS), then adds the low and high half together like gcc does.
Of course, if you compile to 64-bit code, the scalar conversion can just zero-extend the uint32_t
to 64-bit and convert that as a signed int64_t to float. Signed int64_t can represent every value of uint32_t, and x86 can convert a 64-bit signed int to float efficiently. But that doesn't vectorize.