In a codebase of ours I found this snippet for fast, towards-negative-infinity1 rounding on x87:
inline int my_int(double x)
{
int r;
#ifdef _GCC_
asm ("fldl %1\n"
"fistpl %0\n"
:"=m"(r)
:"m"(x));
#else
// ...
#endif
return r;
}
I'm not extremely familiar with GCC extended assembly syntax, but from what I gather from the documentation:
r
must be a memory location, where I'm writing back stuff;x
must be a memory location too, whence the data comes from.- there's no clobber specification, so the compiler can rest assured that at the end of the snippet the registers are as he left them.
Now, to come to my question: it's true that in the end the FPU stack is balanced, but what if all the 8 locations were already in use and I'm overflowing it? How can the compiler know that it cannot trust ST(7)
to be where it left it? Should some clobber be added?
Edit I tried to specify st(7)
in the clobber list and it seems to affect the codegen, now I'll wait for some confirmation of this fact.
As a side note: looking at the implementation of the barebones lrint
both in glibc and in MinGW I see something like
__asm__ __volatile__ ("fistpl %0"
: "=m" (retval)
: "t" (x)
: "st");
where we are asking for the input to be placed directly in ST(0)
(which avoids that potentially useless fldl
); what is that "st"
clobber? The docs seems to mention only t
(i.e. the top of the stack).
- yes, it depends from the current rounding mode, which in our application should always be "towards negative infinity".
This is actually the correct way to represent the code you want as inline assembly.
To get the most optimal possible code generated, you want to make use of the inputs and outputs. Rather than hard-coding the necessary load/store instructions, let the compiler generate them. Not only does this introduce the possibility of eliding potentially unnecessary instructions, it also means that the compiler can better schedule these instructions when they are required (that is, it can interleave the instruction within a prior sequence of code, often minimizing its cost).
The
"st"
clobber refers to thest(0)
register, i.e., the top of the x87 FPU stack. What Intel/MASM notation callsst(0)
, AT&T/GAS notation generally refers to as simplyst
. And, as per GCC's documentation for clobbers, the items in the clobber list are "either register names or the special clobbers" ("cc"
(condition codes/flags) and"memory"
). So this just means that the inline assembly clobbers (overwrites) thest(0)
register. The reason why this clobber is necessary is that thefistpl
instruction pops the top of the stack, thus clobbering the original contents ofst(0)
.The only thing that concerns me regarding this code is the following paragraph from the documentation:
As you already know, the
t
constraint means the top of the x87 FPU stack. The problem is, this is the same as thest
register, and the documentation very clearly said that we could not have a clobber that specifies the same register as one of the input/output operands. Furthermore, since the documentation states that the compiler is forbidden to use any of the clobbered registers to represent input/output operands, this inline assembly makes an impossible request—load this value at the top of the x87 FPU stack without putting it inst
!Now, I would assume that the authors of glibc know what they are doing and are more familiar with the compiler's implementation of inline assembly than you or I, so this code is probably legal and legitimate.
Actually, it seems that the unusual case of the x87's stack-like registers forces an exception to the normal interactions between clobbers and operands. The official documentation says:
That fits our case exactly.
Further confirmation is provided by an example appearing in the official documentation (bottom of the linked section):
Here, the clobber
st(1)
is the same as the input constraintu
, which seems to violate the above-quoted documentation regarding clobbers, but is used and justified for precisely the same reason that"st"
is used as the clobber in your original code, becausefistpl
pops the input.All of that said, and now that you know how to correctly write the code in inline assembly, I have to echo previous commenters who suggested that the best solution would be not to use inline assembly at all. Just call
lrint
, which not only has the exact semantics that you want, but can also be better optimized by the compiler under certain circumstances (e.g., transforming it into a singlecvtsd2si
instruction when the target architecture supports SSE).