可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm working on porting the sqrt function (for 64-bit doubles) from fdlibm to a model-checker tool I'm using at the moment (cbmc).
As part of my doings, I read a lot about the ieee-754 standard, but I think I didn't understand the guarantees of precision for the basic operations (incl. sqrt).

Testing my port of fdlibm's sqrt, I got the following calculation with sqrt on a 64-bit double:

sqrt(1977061516825203605555216616167125005658976571589721139027150498657494589171970335387417823661417383745964289845929120708819092392090053015474001800648403714048.0) = 44464159913633855548904943164666890000299422761159637702558734139742800916250624.0

(this case broke a simple post-condition in my test regarding precision; I'm not sure anymore if this post-condition is possible with IEEE-754)

For a comparison, several multi-precision tools calculated something like:

sqrt(1977061516825203605555216616167125005658976571589721139027150498657494589171970335387417823661417383745964289845929120708819092392090053015474001800648403714048.0) =44464159913633852501611468455197640079591886932526256694498106717014555047373210.truncated

One can see, that the 17-th number from the left is different, meaning an error like:

3047293474709469249920707535828633381008060627422728245868877413.0

Question 1: Is this huge amount of error allowed?

The standard is saying that every basic operation (+,-,*,/,sqrt) should be within 0.5 ulps, meaning that it should be equal to a mathematically exact result rounded to the nearest fp-representation (wiki is saying that some libraries only guarantees 1 ulp, but that isn't that important at the moment).

Question 2: Does that mean, that every basic operation should have an error < 2.220446e-16 with 64-bit doubles (machine-epsilon)?

I did calculate the same with a x86-32 linux system (glibc / eglibc) and got the same result like that obtained with fdlibm, which let me think that:

a: I did something wrong (but how: printf would be a candidate, but I don't know if that could be the reason)
b: the error/precision is common in these libraries

回答1:

The IEEE-754 standard requires that so called "basic operations" (which include addition, multiplication, division and square root) are correctly rounded. This means that there is a unique allowed answer, and it is the closest representable floating-point number to the so-called "infinitely precise" result of the operation.

In double-precision, numbers have 53 binary digits of precision, so the correct answer is the exact answer rounded to 53 significant digits. As Rick Regan showed in his answer, this is exactly the result that you got.

The answers to your questions are:

Question 1: Is this huge amount of error allowed?

Yes, but it is quite misleading to call this error "huge". The fact is that there is no double-precision value that could be returned that would have a smaller error.

Question 2: Does that mean, that every basic operation should have an error < 2.220446e-16 with 64-bit doubles (machine-epsilon)?

No. It means that every basic operation should be rounded to the (unique) closest representable floating-point number according to the current rounding mode. This is not quite the same as saying that the relative error is bounded by machine epsilon.

Question 3: Which result do you obtain with your x86 hardware and gcc + libc?

The same answer you did, because sqrt is correctly rounded on any reasonable platform.

回答2:

In binary, the first 58 bits of the arbitrary precision answer is 1011111111111111111111110101010101111111111111111011010001...

The 53-bit significand of the double value is

10111111111111111111111101010101011111111111111110111

Which means that the double value is correctly rounded to 53 significant bits, and is within 1/2 ULP. (That the error is "large" is only because the number itself is large).