In a codebase of ours I found this snippet for fast, towards-negative-infinity1 rounding on x87:
inline int my_int(double x)
{
int r;
#ifdef _GCC_
asm ("fldl %1\n"
"fistpl %0\n"
:"=m"(r)
:"m"(x));
#else
// ...
#endif
return r;
}
I'm not extremely familiar with GCC extended assembly syntax, but from what I gather from the documentation:
r
must be a memory location, where I'm writing back stuff;
x
must be a memory location too, whence the data comes from.
- there's no clobber specification, so the compiler can rest assured that at the end of the snippet the registers are as he left them.
Now, to come to my question: it's true that in the end the FPU stack is balanced, but what if all the 8 locations were already in use and I'm overflowing it? How can the compiler know that it cannot trust ST(7)
to be where it left it? Should some clobber be added?
Edit I tried to specify st(7)
in the clobber list and it seems to affect the codegen, now I'll wait for some confirmation of this fact.
As a side note: looking at the implementation of the barebones lrint
both in glibc and in MinGW I see something like
__asm__ __volatile__ ("fistpl %0"
: "=m" (retval)
: "t" (x)
: "st");
where we are asking for the input to be placed directly in ST(0)
(which avoids that potentially useless fldl
); what is that "st"
clobber? The docs seems to mention only t
(i.e. the top of the stack).
- yes, it depends from the current rounding mode, which in our application should always be "towards negative infinity".
looking at the implementation of the barebones lrint
both in glibc and in MinGW I see something like
__asm__ __volatile__ ("fistpl %0"
: "=m" (retval)
: "t" (x)
: "st");
where we are asking for the input to be placed directly in ST(0)
(which avoids that potentially useless fldl
)
This is actually the correct way to represent the code you want as inline assembly.
To get the most optimal possible code generated, you want to make use of the inputs and outputs. Rather than hard-coding the necessary load/store instructions, let the compiler generate them. Not only does this introduce the possibility of eliding potentially unnecessary instructions, it also means that the compiler can better schedule these instructions when they are required (that is, it can interleave the instruction within a prior sequence of code, often minimizing its cost).
what is that "st"
clobber? The docs seems to mention only t
(i.e. the top of the stack).
The "st"
clobber refers to the st(0)
register, i.e., the top of the x87 FPU stack. What Intel/MASM notation calls st(0)
, AT&T/GAS notation generally refers to as simply st
. And, as per GCC's documentation for clobbers, the items in the clobber list are "either register names or the special clobbers" ("cc"
(condition codes/flags) and "memory"
). So this just means that the inline assembly clobbers (overwrites) the st(0)
register. The reason why this clobber is necessary is that the fistpl
instruction pops the top of the stack, thus clobbering the original contents of st(0)
.
The only thing that concerns me regarding this code is the following paragraph from the documentation:
Clobber descriptions may not in any way overlap with an input or output operand. For example, you may not have an operand describing a register class with one member when listing that register in the clobber list. Variables declared to live in specific registers (see Explicit Register Variables) and used as asm input or output operands must have no part mentioned in the clobber description. In particular, there is no way to specify that input operands get modified without also specifying them as output operands.
When the compiler selects which registers to use to represent input and output operands, it does not use any of the clobbered registers. As a result, clobbered registers are available for any use in the assembler code.
As you already know, the t
constraint means the top of the x87 FPU stack. The problem is, this is the same as the st
register, and the documentation very clearly said that we could not have a clobber that specifies the same register as one of the input/output operands. Furthermore, since the documentation states that the compiler is forbidden to use any of the clobbered registers to represent input/output operands, this inline assembly makes an impossible request—load this value at the top of the x87 FPU stack without putting it in st
!
Now, I would assume that the authors of glibc know what they are doing and are more familiar with the compiler's implementation of inline assembly than you or I, so this code is probably legal and legitimate.
Actually, it seems that the unusual case of the x87's stack-like registers forces an exception to the normal interactions between clobbers and operands. The official documentation says:
On x86 targets, there are several rules on the usage of stack-like registers in the operands of an asm. These rules apply only to the operands that are stack-like registers:
Given a set of input registers that die in an asm, it is necessary to know which are implicitly popped by the asm, and which must be explicitly popped by GCC.
An input register that is implicitly popped by the asm must be explicitly clobbered, unless it is constrained to match an output operand.
That fits our case exactly.
Further confirmation is provided by an example appearing in the official documentation (bottom of the linked section):
This asm takes two inputs, which are popped by the fyl2xp1
opcode, and replaces them with one output. The st(1)
clobber is necessary for the compiler to know that fyl2xp1
pops both inputs.
asm ("fyl2xp1" : "=t" (result) : "0" (x), "u" (y) : "st(1)");
Here, the clobber st(1)
is the same as the input constraint u
, which seems to violate the above-quoted documentation regarding clobbers, but is used and justified for precisely the same reason that "st"
is used as the clobber in your original code, because fistpl
pops the input.
All of that said, and now that you know how to correctly write the code in inline assembly, I have to echo previous commenters who suggested that the best solution would be not to use inline assembly at all. Just call lrint
, which not only has the exact semantics that you want, but can also be better optimized by the compiler under certain circumstances (e.g., transforming it into a single cvtsd2si
instruction when the target architecture supports SSE).