Converting from unsigned long long to float with r

2019-05-05 21:19发布

问题:

I need to write a function that rounds from unsigned long long to float, and the rounding should be toward nearest even. I cannot just do a C++ type-cast, since AFAIK the standard does not specify the rounding. I was thinking of using boost::numeric, but i could not find any useful lead after reading the documentation. Can this be done using that library? Of course, if there is an alternative, i would be glad to use it.

Any help would be much appreciated.

EDIT: Adding an example to make things a bit clearer. Suppose i want to convert 0xffffff7fffffffff to its floating point representation. The C++ standard permits either one of:

  1. 0x5f7fffff ~ 1.9999999*2^63
  2. 0x5f800000 = 2^64

Now if you add the restriction of round to nearest even, only the first result is acceptable.

回答1:

Since you have so many bits in the source that can't be represented in the float and you can't (apparently) rely on the language's conversion, you'll have to do it yourself.

I devised a scheme that may or may not help you. Basically, there are 31 bits to represent positive numbers in a float so I pick up the 31 most significant bits in the source number. Then I save off and mask away all the lower bits. Then based on the value of the lower bits I round the "new" LSB up or down and finally use static_cast to create a float.

I left in some couts that you can remove as desired.

const unsigned long long mask_bit_count = 31;

float ull_to_float2(unsigned long long val)
{
    // How many bits are needed?
    int b = sizeof(unsigned long long) * CHAR_BIT - 1;
    for(; b >= 0; --b)
    {
        if(val & (1ull << b))
        {
            break;
        }
    }

    std::cout << "Need " << (b + 1) << " bits." << std::endl;

    // If there are few enough significant bits, use normal cast and done.
    if(b < mask_bit_count)
    {
        return static_cast<float>(val & ~1ull);
    }

    // Save off the low-order useless bits:
    unsigned long long low_bits = val & ((1ull << (b - mask_bit_count)) - 1);
    std::cout << "Saved low bits=" << low_bits << std::endl;

    std::cout << val << "->mask->";
    // Now mask away those useless low bits:
    val &= ~((1ull << (b - mask_bit_count)) - 1);
    std::cout << val << std::endl;

    // Finally, decide how to round the new LSB:
    if(low_bits > ((1ull << (b - mask_bit_count)) / 2ull))
    {
        std::cout << "Rounding up " << val;
        // Round up.
        val |= (1ull << (b - mask_bit_count));
        std::cout << " to " << val << std::endl;
    }
    else
    {
        // Round down.
        val &= ~(1ull << (b - mask_bit_count));
    }

    return static_cast<float>(val);
}


回答2:

I did this in Smalltalk for arbitrary precision integer (LargeInteger), implemented and tested in Squeak/Pharo/Visualworks/Gnu Smalltalk/Dolphin Smalltalk, and even blogged about it if you can read Smalltalk code http://smallissimo.blogspot.fr/2011/09/clarifying-and-optimizing.html .
The trick for accelerating the algorithm is this one: IEEE 754 compliant FPU will round exactly the result of an inexact operation. So we can afford 1 inexact operation and let the hardware rounds correctly for us. That let us handle easily first 48 bits. But we cannot afford two inexact operations, so we sometimes have to care of the lowest bits differently...
Hope the code is documented enough:

#include <math.h>
#include <float.h>
float ull_to_float3(unsigned long long val)
{
    int prec=FLT_MANT_DIG ;             // 24 bits, the float precision
    unsigned long long high=val>>prec;  // the high bits above float precision
    unsigned long long mask=(1ull<<prec) - 1 ;      // 0xFFFFFFull a mask for extracting significant bits
    unsigned long long tmsk=(1ull<<(prec - 1)) - 1; // 0x7FFFFFull same but tie bit
    // handle trivial cases, 48 bits or less,
    // let FPU apply correct rounding after exactly 1 inexact operation
    if( high <= mask )
        return ldexpf((float) high,prec) + (float) (val & mask);
    // more than 48 bits,
    // what scaling s is needed to isolate highest 48 bits of val?
    int s = 0;
    for( ; high > mask ; high >>= 1) ++s;
    // high now contains highest 24 bits
    float f_high = ldexpf( (float) high , prec + s );
    // store next 24 bits in mid
    unsigned long long mid = (val >> s) & mask;
    // care of rare case when trailing low bits can change the rounding:
    // can mid bits be a case of perfect tie or perfect zero?
    if( (mid & tmsk) == 0ull )
    {
        // if low bits are zero, mid is either an exact tie or an exact zero
        // else just increment mid to distinguish from such case
        unsigned long long low = val & ((1ull << s) - 1);
        if(low > 0ull) mid++;
    }
    return f_high + ldexpf( (float) mid , s );
}

Bonus: this code should round according to your FPU rounding mode whatever it may be, since we implicitely used the FPU to perform rounding with + operation.
However, beware of aggressive optimizations in standards < C99, who knows when the compiler will use extended precision... (unless you force something like -ffloat-store).
If you always want to round to nearest even, whatever the current rounding mode, then you'll have to increment high bits when:

  • mid bits > tie, where tie=1ull<<(prec-1);
  • mid bits == tie and (low bits > 0 or high bits is odd).

EDIT:
If you stick to round-to-nearest-even tie breaking, then another solution is to use Shewchuck EXPANSION-SUM of non adjacent parts (fhigh,flow) and (fmid) see http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps :

#include <math.h>
#include <float.h>
float ull_to_float4(unsigned long long val)
{
    int prec=FLT_MANT_DIG ;             // 24 bits, the float precision
    unsigned long long mask=(1ull<<prec) - 1 ; // 0xFFFFFFull a mask for extracting significant bits
    unsigned long long high=val>>(2*prec);     // the high bits
    unsigned long long mid=(val>>prec) & mask; // the mid bits
    unsigned long long low=val & mask;         // the low bits
    float fhigh = ldexpf((float) high,2*prec);
    float fmid  = ldexpf((float) mid,prec);
    float flow  = (float) low;
    float sum1 = fmid + flow;
    float residue1 = flow - (sum1 - fmid);
    float sum2 = fhigh + sum1;
    float residue2 = sum1 - (sum2 - fhigh);
    return (residue1 + residue2) + sum2;
}

This makes a branch-free algorithm with a bit more ops. It may work with other rounding modes, but I let you analyze the paper to make sure...



回答3:

What is possible between between 8-byte integers and the float format is straightforward to explain but less so to implement!

The next paragraph concerns what is representable in 8 byte signed integers.

All positive integers between 1 (2^0) and 16777215 (2^24-1) are exactly representable in iEEE754 single precision (float). Or, to be precise, all numbers between 2^0 and 2^24-2^0 in increments of 2^0. The next range of exactly representable positive integers is 2^1 to 2^25-2^1 in increments of 2^1 and so on up to 2^39 to 2^63-2^39 in increments of 2^39.

Unsigned 8-byte integer values can be expressed up to 2^64-2^40 in increments of 2^40.

The single precison format doesn't stop here but goes on all the way up to the range 2^103 to 2^127-2^103 in increments of 2^103.

For 4-byte integers (long) the highest float range is 2^7 to 2^31-2^7 in 2^7 increments.

On the x86 architecture the largest integer type supported by the floating point instruction set is the 8 byte signed integer. 2^64-1 cannot be loaded by conventional means.

This means that for a given range increment expressed as "2^i where i is an integer >0" all integers that end with the bit pattern 0x1 up to 2^i-1 will not be exactly representable within that range in a float This means that what you call rounding upwards is actually dependent on what range you are working in. It is of no use to try to round up by 1 (2^0) or 16 (2^4) if the granularity of the range you are in is 2^19.

An additional consequence of what you propose to do (rounding 2^63-1 to 2^63) could result in an (long integer format) overflow if you attempt the following conversion: longlong_int=(long long) ((float) 2^63).

Check out this small program I wrote (in C) which should help illustrate what is possible and what isn't.

int main (void)
{
  __int64 basel=1,baseh=16777215,src,dst,j;
  float cnvl,cnvh,range;
  int i=0;

  while (i<40)
  {
    src=basel<<i;
    cnvl=(float) src;
    dst=(__int64) cnvl;    /* compare dst with basel */

    src=baseh<<i;
    cnvh=(float) src;
    dst=(__int64) cnvh;    /* compare dst with baseh */

    j=basel;
    while (j<=baseh)
    {
      range=(float) j;
      dst=(__int64) range;

      if (j!=dst) dst/=0;

      j+=basel;
    }

    ++i;
  }
  return i;
}

This program shows the representable integer value ranges. There is overlap beteen them: for example 2^5 is representable in all ranges with a lower boundary 2^b where 1=