First of all, this is not a floating point newbie question. I know results of floating point arithmetic (not to mention transcendental functions) usually cannot be represented exactly, and that most terminating decimals cannot be represented exactly as binary floating point numbers.
That said, each possible floating point value corresponds exactly to a diadic rational (a rational number p/q
where q
is a power of 2), which in turn has an exact decimal representation.
My question is: How do you find this exact decimal representation efficiently? sprintf
and similar functions are usually only specified up to a number of significant digits to uniquely determine the original floating point value; they don't necessarily print the exact decimal representation. I know one algorithm I've used, but it's very slow, O(e^2)
where e
is the exponent. Here's an outline:
- Convert the mantissa to a decimal integer. You can either do this by pulling apart the bits to read the mantissa directly, or you can write a messy floating point loop that first multiplies the value by a power of two to put it in the range 1<=x<10, then pulls off a digit at a time by casting to int, subtracting, and multiplying by 10.
- Apply the exponent by repeatedly multiplying or dividing by 2. This is an operation on the string of decimal digits you generated. Every ~3 multiplications will add an extra digit to the left. Every single dividion will add an extra digit to the right.
Is this really the best possible? I doubt it, but I'm not a floating-point expert and I can't find a way to do the base-10 computations on the floating point representation of the number without running into a possibility of inexact results (multiplying or dividing by anything but a power of 2 is a lossy operation on floating point numbers unless you know you have free bits to work with).
There's been a lot of work on printing floating-point numbers. The gold standard is to print out a decimal equivalent of minimal length such that when the decimal equivalent is read back in, you get the same floating-point number that you started with, no matter what the rounding mode is during readback. You can read about the algorithm in the excellent paper by Burger and Dybvig.
There are 3 ways
printing numbers in
bin
orhex
This is the most precise way. I prefer
hex
because it is more like base10
for reading/feel like for exampleF.8h = 15.5
no precision loss here.printing in
dec
but only the relevant digitsWith this I mean only digits which can have
1
in your number represented as close as possible.num
of integer digits are easy and precise (no precision loss):num
of fractional digits are more tricky and empirically I found this:if you make a dependency table
n02 <-> n10
then you see that constant0.30102999566398119521373889472449
is still present, but at start from 8 bits because less cannot represent0.1
with satisfactory precision (0.85 - 1.15
). because of negative exponents of base2
the dependency is not linear, instead it patterns. This code works for small bit count (<=52
) but at larger bit counts there can be error because used pattern do not fitlog10(2)
or1/log2(10)
exactly.for larger bit counts I use this:
but that formula is 32bit aligned !!! and also bigger bit count ads error to it
P.S. further analysis of binary representation of decadic numbers
may reveal the exact pattern repetition which would lead to exact number of relevant digits for any bit count.
for clarity:
And at last do not forget to round the cut off digits !!! That means if digit after the last relevant digit is
>=5
in dec than last relevant digit should be+1
... and if it is already9
than you must go to previous digit and so on...print exact value
To print exact value of fractional binary number just print fractional
n
digits wheren
is the number of fractional bits because the value represented is sum of this values so the number of fractional decimals cannot be bigger than thenum
of fractional digits of LSB. Stuff above (bullet #2) is relevant for storing decimal number tofloat
(or printing just relevant decimals).negative powers of two exact values...
now negative powers of
10
printed by exact value style for 64 bitdoubles
:now negative powers of 10 printed by relevant decimal digits only style (i am more used to this) for 64bit
doubles
:hope it helps :)
You don't. The closest you can come to that is dumping the bytes.
If you want more exact results, why not use fixed point math instead? Conversions are quick. Error is known and can be worked around. Not an exact answer to your question, but a different idea for you.
Off the top of my head, why not break the exponent down into a sum of binary exponents first, then all your operations are loss-less.
I.e.
Then sum:
I'm thinking that breaking it down would be on the O(n) on the the number of digits, the shifting is O(1), and the summing is O(n) digits...
You would have to have an integer class big enough to store the results, of course...
Let me know - I'm curious about this, it really made me think. :-)
Well being no floating point expert myself, I'd defer to using a well tested open source library.
The GNU MPFR is a good one.