x86 MUL Instruction from VS 2008/2010

2019-01-06 21:34发布

站内文章 / C++

90 0

冷血范

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Will modern (2008/2010) incantations of Visual Studio or Visual C++ Express produce x86 MUL instructions (unsigned multiply) in the compiled code? I cannot seem to find or contrive an example where they appear in compiled code, even when using unsigned types.

If VS does not compile using MUL, is there a rationale why?

回答1:

imul (signed) and mul (unsigned) both have a one-operand form that does edx:eax = eax * src. i.e. a 32x32b => 64b full multiply (or 64x64b => 128b).

286 added an imul dest(reg), src(reg/mem), immediate form, and 386 added an imul r32, r/m32 form, both of which which only compute the lower half of the result. (Link from the x86 tag wiki).

When multiplying two 32-bit values, the least significant 32 bits of the result are the same, whether you consider the values to be signed or unsigned. In other words, the difference between a signed and an unsigned multiply becomes apparent only if you look at the "upper" half of the result, which one-operand imul/mul puts in edx and two or three operand imul puts nowhere. Thus, the multi-operand forms of imul can be used on signed and unsigned values, and there was no need for Intel to add new forms of mul as well. (They could have made multi-operand mul a synonym for imul, but that would make disassembly output not match the source.)

In C, results of arithmetic operations have the same type as the operands (after integer promotion for narrow integer types). If you multiply two int together, you get an int, not a long long: the "upper half" is not retained. Hence, the C compiler only needs what imul provides, and since imul is easier to use than mul, the C compiler uses imul to avoid needing mov instructions to get data into / out of eax.

As a second step, since C compilers use the multiple-operand form of imul a lot, Intel and AMD invest effort into making it as fast as possible. It only writes one output register, not e/rdx:e/rax, so it was possible for CPUs to optimize it more easily than the one-operand form. This makes imul even more attractive.

The one-operand form of mul/imul is useful when implementing big number arithmetic. In C, in 32-bit mode, you should get some mul invocations by multiplying unsigned long long values together. But, depending on the compiler and OS, those mul opcodes may be hidden in some dedicated function, so you will not necessarily see them. In 64-bit mode, long long has only 64 bits, not 128, and the compiler will simply use imul.

回答2:

There's three different types of multiply instructions on x86. The first is MUL reg, which does an unsigned multiply of EAX by reg and puts the (64-bit) result into EDX:EAX. The second is IMUL reg, which does the same with a signed multiply. The third type is either IMUL reg1, reg2 (multiplies reg1 with reg2 and stores the 32-bit result into reg1) or IMUL reg1, reg2, imm (multiplies reg2 by imm and stores the 32-bit result into reg1).

Since in C, multiplies of two 32-bit values produce 32-bit results, compilers normally use the third type (signedness doesn't matter, the low 32 bits agree between signed and unsigned 32x32 multiplies). VC++ will generate the "long multiply" versions of MUL/IMUL if you actually use the full 64-bit results, e.g. here:

unsigned long long prod(unsigned int a, unsigned int b)
{
  return (unsigned long long) a * b;
}

The 2-operand (and 3-operand) versions of IMUL are faster than the one-operand versions simply because they don't produce a full 64-bit result. Wide multipliers are large and slow; it's much easier to build a smaller multiplier and synthesize long multiplies using Microcode if necessary. Also, MUL/IMUL writes two registers, which again is usually resolved by breaking it into multiple instructions internally - it's much easier for the instruction reordering hardware to keep track of two dependent instructions that each write one register (most x86 instructions look like that internally) than it is to keep track of one instruction that writes two.

回答3:

According to http://gmplib.org/~tege/x86-timing.pdf, the IMUL instruction has a lower latency and higher throughput (if I'm reading the table correctly). Perhaps VS is simply using the faster instruction (that's assuming that IMUL and MUL always produce the same output).

I don't have Visual Studio handy, so I tried to get something else with GCC. I also always get some variation of IMUL.

This:

unsigned int func(unsigned int a, unsigned int b)
{ 
    return a * b;
}

Assembles to this (with -O2):

_func:
LFB2:
        pushq   %rbp
LCFI0:
        movq    %rsp, %rbp
LCFI1:
        movl    %esi, %eax
        imull   %edi, %eax
        movzbl  %al, %eax
        leave
        ret

回答4:

My intuition tells me that the compiler chose IMUL arbitrarily (or whichever was faster of the two) since the bits will be the same whether it uses unsigned MUL or signed IMUL. Any 32-bit integer multiplication will be 64-bits spanning two registers, EDX:EAX. The overflow goes into EDX which is essentially ignored since we only care about the 32-bit result in EAX. Using IMUL will sign-extend into EDX as necessary but again, we don't care since we're only interested in the 32-bit result.

回答5:

Right after I looked at this question I found MULQ in my generated code when dividing.

The full code is turning a large binary number into chunks of a billion so that it can be easily converted to a string.

C++ code:

for_each(TempVec.rbegin(), TempVec.rend(), [&](Short & Num){
    Remainder <<= 32;
    Remainder += Num;
    Num = Remainder / 1000000000;
    Remainder %= 1000000000;//equivalent to Remainder %= DecimalConvert
});

Optimized Generated Assembly

00007FF7715B18E8  lea         r9,[rsi-4]  
00007FF7715B18EC  mov         r13,12E0BE826D694B2Fh  
00007FF7715B18F6  nop         word ptr [rax+rax] 
00007FF7715B1900  shl         r8,20h  
00007FF7715B1904  mov         eax,dword ptr [r9]  
00007FF7715B1907  add         r8,rax  
00007FF7715B190A  mov         rax,r13  
00007FF7715B190D  mul         rax,r8  
00007FF7715B1910  mov         rcx,r8  
00007FF7715B1913  sub         rcx,rdx  
00007FF7715B1916  shr         rcx,1  
00007FF7715B1919  add         rcx,rdx  
00007FF7715B191C  shr         rcx,1Dh  
00007FF7715B1920  imul        rax,rcx,3B9ACA00h  
00007FF7715B1927  sub         r8,rax  
00007FF7715B192A  mov         dword ptr [r9],ecx  
00007FF7715B192D  lea         r9,[r9-4]  
00007FF7715B1931  lea         rax,[r9+4]  
00007FF7715B1935  cmp         rax,r14  
00007FF7715B1938  jne         NumToString+0D0h (07FF7715B1900h)

Notice the MUL instruction 5 lines down. This generated code is extremely unintuitive, I know, in fact it looks nothing like the compiled code but DIV is extremely slow ~25 cycles for a 32 bit div, and ~75 according to this chart on modern PCs compared with MUL or IMUL (around 3 or 4 cycles) and so it makes sense to try to get rid of DIV even if you have to add all sorts of extra instructions.

I don't fully understand the optimization here, but if you would like to see a rational and a mathematical explanation of using compile time and multiplication to divide constants, see this paper.

This is an example of is the compiler making use of the performance and capability of the full 64 by 64 bit untruncated multiply without showing the c++ coder any sign of it.

回答6:

As already explained C/C++ does not do word*word to double-word operations which is what the mul instruction is best for. But there are cases where you want word*word to double-word so you need an extension to C/C++.

GCC, Clang, and ICC provide provide a builtin type __int128 which you can use to indirectly get the mul instruction.

With MSVC it provides the _umul128 intrinsic (since at least VS 2010) which generates the mul instruction. With this intrinsic along with the _addcarry_u64 intrinsic you could build your own efficient __int128 type with MSVC.