可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Will modern (2008/2010) incantations of Visual Studio or Visual C++ Express produce x86 MUL instructions (unsigned multiply) in the compiled code? I cannot seem to find or contrive an example where they appear in compiled code, even when using unsigned types.
If VS does not compile using MUL, is there a rationale why?
回答1:
imul
(signed) and mul
(unsigned) both have a one-operand form that does edx:eax = eax * src
. i.e. a 32x32b => 64b full multiply (or 64x64b => 128b).
286 added an imul dest(reg), src(reg/mem), immediate
form, and 386 added an imul r32, r/m32
form, both of which which only compute the lower half of the result. (Link from the x86 tag wiki).
When multiplying two 32-bit values, the least significant 32 bits of the result are the same, whether you consider the values to be signed or unsigned. In other words, the difference between a signed and an unsigned multiply becomes apparent only if you look at the "upper" half of the result, which one-operand imul
/mul
puts in edx
and two or three operand imul
puts nowhere. Thus, the multi-operand forms of imul
can be used on signed and unsigned values, and there was no need for Intel to add new forms of mul
as well. (They could have made multi-operand mul
a synonym for imul
, but that would make disassembly output not match the source.)
In C, results of arithmetic operations have the same type as the operands (after integer promotion for narrow integer types). If you multiply two int
together, you get an int
, not a long long
: the "upper half" is not retained. Hence, the C compiler only needs what imul
provides, and since imul
is easier to use than mul
, the C compiler uses imul
to avoid needing mov
instructions to get data into / out of eax
.
As a second step, since C compilers use the multiple-operand form of imul
a lot, Intel and AMD invest effort into making it as fast as possible. It only writes one output register, not e/rdx:e/rax
, so it was possible for CPUs to optimize it more easily than the one-operand form. This makes imul
even more attractive.
The one-operand form of mul
/imul
is useful when implementing big number arithmetic. In C, in 32-bit mode, you should get some mul
invocations by multiplying unsigned long long
values together. But, depending on the compiler and OS, those mul
opcodes may be hidden in some dedicated function, so you will not necessarily see them. In 64-bit mode, long long
has only 64 bits, not 128, and the compiler will simply use imul
.
回答2:
There's three different types of multiply instructions on x86. The first is MUL reg
, which does an unsigned multiply of EAX
by reg and puts the (64-bit) result into EDX:EAX
. The second is IMUL reg
, which does the same with a signed multiply. The third type is either IMUL reg1, reg2
(multiplies reg1 with reg2 and stores the 32-bit result into reg1) or IMUL reg1, reg2, imm
(multiplies reg2 by imm and stores the 32-bit result into reg1).
Since in C, multiplies of two 32-bit values produce 32-bit results, compilers normally use the third type (signedness doesn't matter, the low 32 bits agree between signed and unsigned 32x32 multiplies). VC++ will generate the "long multiply" versions of MUL
/IMUL
if you actually use the full 64-bit results, e.g. here:
unsigned long long prod(unsigned int a, unsigned int b)
{
return (unsigned long long) a * b;
}
The 2-operand (and 3-operand) versions of IMUL
are faster than the one-operand versions simply because they don't produce a full 64-bit result. Wide multipliers are large and slow; it's much easier to build a smaller multiplier and synthesize long multiplies using Microcode if necessary. Also, MUL/IMUL writes two registers, which again is usually resolved by breaking it into multiple instructions internally - it's much easier for the instruction reordering hardware to keep track of two dependent instructions that each write one register (most x86 instructions look like that internally) than it is to keep track of one instruction that writes two.
回答3:
According to http://gmplib.org/~tege/x86-timing.pdf, the IMUL
instruction has a lower latency and higher throughput (if I'm reading the table correctly). Perhaps VS is simply using the faster instruction (that's assuming that IMUL
and MUL
always produce the same output).
I don't have Visual Studio handy, so I tried to get something else with GCC. I also always get some variation of IMUL
.
This:
unsigned int func(unsigned int a, unsigned int b)
{
return a * b;
}
Assembles to this (with -O2):
_func:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl %esi, %eax
imull %edi, %eax
movzbl %al, %eax
leave
ret
回答4:
My intuition tells me that the compiler chose IMUL
arbitrarily (or whichever was faster of the two) since the bits will be the same whether it uses unsigned MUL
or signed IMUL
. Any 32-bit integer multiplication will be 64-bits spanning two registers, EDX:EAX
. The overflow goes into EDX
which is essentially ignored since we only care about the 32-bit result in EAX
. Using IMUL
will sign-extend into EDX
as necessary but again, we don't care since we're only interested in the 32-bit result.
回答5:
Right after I looked at this question I found MULQ in my generated code when dividing.
The full code is turning a large binary number into chunks of a billion so that it can be easily converted to a string.
C++ code:
for_each(TempVec.rbegin(), TempVec.rend(), [&](Short & Num){
Remainder <<= 32;
Remainder += Num;
Num = Remainder / 1000000000;
Remainder %= 1000000000;//equivalent to Remainder %= DecimalConvert
});
Optimized Generated Assembly
00007FF7715B18E8 lea r9,[rsi-4]
00007FF7715B18EC mov r13,12E0BE826D694B2Fh
00007FF7715B18F6 nop word ptr [rax+rax]
00007FF7715B1900 shl r8,20h
00007FF7715B1904 mov eax,dword ptr [r9]
00007FF7715B1907 add r8,rax
00007FF7715B190A mov rax,r13
00007FF7715B190D mul rax,r8
00007FF7715B1910 mov rcx,r8
00007FF7715B1913 sub rcx,rdx
00007FF7715B1916 shr rcx,1
00007FF7715B1919 add rcx,rdx
00007FF7715B191C shr rcx,1Dh
00007FF7715B1920 imul rax,rcx,3B9ACA00h
00007FF7715B1927 sub r8,rax
00007FF7715B192A mov dword ptr [r9],ecx
00007FF7715B192D lea r9,[r9-4]
00007FF7715B1931 lea rax,[r9+4]
00007FF7715B1935 cmp rax,r14
00007FF7715B1938 jne NumToString+0D0h (07FF7715B1900h)
Notice the MUL instruction 5 lines down.
This generated code is extremely unintuitive, I know, in fact it looks nothing like the compiled code but DIV is extremely slow ~25 cycles for a 32 bit div, and ~75 according to this chart on modern PCs compared with MUL or IMUL (around 3 or 4 cycles) and so it makes sense to try to get rid of DIV even if you have to add all sorts of extra instructions.
I don't fully understand the optimization here, but if you would like to see a rational and a mathematical explanation of using compile time and multiplication to divide constants, see this paper.
This is an example of is the compiler making use of the performance and capability of the full 64 by 64 bit untruncated multiply without showing the c++ coder any sign of it.
回答6:
As already explained C/C++ does not do word*word to double-word
operations which is what the mul
instruction is best for. But there are cases where you want word*word to double-word
so you need an extension to C/C++.
GCC, Clang, and ICC provide provide a builtin type __int128
which you can use to indirectly get the mul
instruction.
With MSVC it provides the _umul128 intrinsic (since at least VS 2010) which generates the mul
instruction. With this intrinsic along with the _addcarry_u64 intrinsic you could build your own efficient __int128
type with MSVC.