As far as I can tell, the only difference between __asm { ... };
and __asm__("...");
is that the first uses mov eax, var
and the second uses movl %0, %%eax
with :"=r" (var)
at the end. What other differences are there? And what about just asm
?
相关问题
- Multiple sockets for clients to connect to
- What is the best way to do a search in a large fil
- glDrawElements only draws half a quad
- how to call a C++ dll from C# windows application
- efficiently calling unmanaged method taking unmana
asm
vs__asm__
in GCCasm
does not work with-std=c99
, you have two alternatives:__asm__
-std=gnu99
More details: error: ‘asm’ undeclared (first use in this function)
__asm
vs__asm__
in GCCI could not find where
__asm
is documented (notably not mentioned at https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/Alternate-Keywords.html#Alternate-Keywords ), but from the GCC 8.1 source they are exactly the same:so I would just use
__asm__
which is documented.There's a massive difference between MSVC inline asm and GNU C inline asm. GCC syntax is designed for optimal output without wasted instructions, for wrapping a single instruction or something. MSVC syntax is designed to be fairly simple, but AFAICT it's impossible to use without the latency and extra instructions of a round trip through memory for your inputs and outputs.
If you're using inline asm for performance reasons, this makes MSVC inline asm only viable if you write a whole loop entirely in asm, not for wrapping short sequences in an inline function. The example below (wrapping
idiv
with a function) is the kind of thing MSVC is bad at: ~8 extra store/load instructions.MSVC inline asm (used by MSVC and probably icc, maybe also available in some commercial compilers):
mov ecx, shift_count
, for example. So using a single asm instruction that the compiler won't generate for you involves a round-trip through memory on the way in and on the way out.GNU C inline asm is not a good way to learn asm. You have to understand asm very well so you can tell the compiler about your code. And you have to understand what compilers need to know. That answer also has links to other inline-asm guides and Q&As. The x86 tag wiki has lots of good stuff for asm in general, but just links to that for GNU inline asm. (The stuff in that answer is applicable to GNU inline asm on non-x86 platforms, too.)
GNU C inline asm syntax is used by gcc, clang, icc, and maybe some commercial compilers which implement GNU C:
"c" (shift_count)
will get the compiler to put theshift_count
variable intoecx
before your inline asm runs.extra clunky for large blocks of code, because the asm has to be inside a string constant. So you typically need
very unforgiving / harder, but allows lower overhead esp. for wrapping single instructions. (wrapping single instructions was the original design intent, which is why you have to specially tell the compiler about early clobbers to stop it from using the same register for an input and output if that's a problem.)
Example: full-width integer division (
div
)On a 32bit CPU, dividing a 64bit integer by a 32bit integer, or doing a full-multiply (32x32->64), can benefit from inline asm. gcc and clang don't take advantage of
idiv
for(int64_t)a / (int32_t)b
, probably because the instruction faults if the result doesn't fit in a 32bit register. So unlike this Q&A about getting quotient and remainder from onediv
, this is a use-case for inline asm. (Unless there's a way to inform the compiler that the result will fit, so idiv won't fault.)We'll use calling conventions that put some args in registers (with
hi
even in the right register), to show a situation that's closer to what you'd see when inlining a tiny function like this.MSVC
Be careful with register-arg calling conventions when using inline-asm. Apparently the inline-asm support is so badly designed/implemented that the compiler might not save/restore arg registers around the inline asm, if those args aren't used in the inline asm. Thanks @RossRidge for pointing this out.
Update: apparently leaving a value in
eax
oredx:eax
and then falling off the end of a non-void function (without areturn
) is supported, even when inlining. I assume this works only if there's no code after theasm
statement. This avoids the store/reloads for the output (at least forquotient
), but we can't do anything about the inputs. In a non-inline function with stack args, they will be in memory already, but in this use-case we're writing a tiny function that could usefully inline.Compiled with MSVC 19.00.23026
/O2
on rextester (with amain()
that finds the directory of the exe and dumps the compiler's asm output to stdout).There's a ton of extra mov instructions, and the compiler doesn't even come close to optimizing any of it away. I thought maybe it would see and understand the
mov tmp, edx
inside the inline asm, and make that a store topremainder
. But that would require loadingpremainder
from the stack into a register before the inline asm block, I guess.This function is actually worse with
_vectorcall
than with the normal everything-on-the-stack ABI. With two inputs in registers, it stores them to memory so the inline asm can load them from named variables. If this were inlined, even more of the parameters could potentially be in the regs, and it would have to store them all, so the asm would have memory operands! So unlike gcc, we don't gain much from inlining this.Doing
*premainder = tmp
inside the asm block means more code written in asm, but does avoid the totally braindead store/load/store path for the remainder. This reduces the instruction count by 2 total, down to 11 (not including theret
).I'm trying to get the best possible code out of MSVC, not "use it wrong" and create a straw-man argument. But AFAICT it's horrible for wrapping very short sequences. Presumably there's an intrinsic function for 64/32 -> 32 division that allows the compiler to generate good code for this particular case, so the entire premise of using inline asm for this on MSVC could be a straw-man argument. But it does show you that intrinsics are much better than inline asm for MSVC.
GNU C (gcc/clang/icc)
Gcc does even better than the output shown here when inlining div64, because it can typically arrange for the preceding code to generate the 64bit integer in edx:eax in the first place.
I can't get gcc to compile for the 32bit vectorcall ABI. Clang can, but it sucks at inline asm with
"rm"
constraints (try it on the godbolt link: it bounces function arg through memory instead of using the register option in the constraint). The 64bit MS calling convention is close to the 32bit vectorcall, with the first two params in edx, ecx. The difference is that 2 more params go in regs before using the stack (and that the callee doesn't pop the args off the stack, which is what theret 8
was about in the MSVC output.)compiled with
gcc -m64 -O3 -mabi=ms -fverbose-asm
. With -m32 you just get 3 loads, idiv, and a store, as you can see from changing stuff in that godbolt link.For 32bit vectorcall, gcc would do something like
MSVC uses 13 instructions (not including the ret), compared to gcc's 4. With inlining, as I said, it potentially compiles to just one, while MSVC would still use probably 9. (It won't need to reserve stack space or load
premainder
; I'm assuming it still has to store about 2 of the 3 inputs. Then it reloads them inside the asm, runsidiv
, stores two outputs, and reloads them outside the asm. So that's 4 loads/stores for input, and another 4 for output.)With gcc compiler, it's not a big difference.
asm
or__asm
or__asm__
are same, they just use to avoid conflict namespace purpose (there's user defined function that name asm, etc.)Which one you use depends on your compiler. This isn't standard like the C language.