I would like my C function to efficiently compute the high 64 bits of the product of two 64 bit signed ints. I know how to do this in x86-64 assembly, with imulq and pulling the result out of %rdx. But I'm at a loss for how to write this in C at all, let alone coax the compiler to do it efficiently.
Does anyone have any suggestions for writing this in C? This is performance sensitive, so "manual methods" (like Russian Peasant, or bignum libraries) are out.
This dorky inline assembly function I wrote works and is roughly the codegen I'm after:
static long mull_hi(long inp1, long inp2) {
long output = -1;
__asm__("movq %[inp1], %%rax;"
"imulq %[inp2];"
"movq %%rdx, %[output];"
: [output] "=r" (output)
: [inp1] "r" (inp1), [inp2] "r" (inp2)
:"%rax", "%rdx");
return output;
}
The general answer is that
x * y
can be broken down into(a + b) * (c + d)
, wherea
andc
are the high order parts.First, expand to
ac + ad + bc + bd
Now, you multiply the terms as 32 bit numbers stored as
long long
(or better yet,uint64_t
), and you just remember that when you multiplied a higher order number, you need to scale by 32 bits. Then you do the adds, remembering to detect carry. Keep track of the sign. Naturally, you need to do the adds in pieces.For code implementing the above, see my other answer.
Wait, you have a perfectly good, optimized assembly solution already working for this, and you want to back it out and try to write it in an environment that doesn't support 128 bit math? I'm not following.
As you're obviously aware, this operation is a single instruction on x86-64. Obviously nothing you do is going to make it work any better. If you really want portable C, you'll need to do something like DigitalRoss's code above and hope that your optimizer figures out what you're doing.
If you need architecture portability but are willing to limit yourself to gcc platforms, there are __int128_t (and __uint128_t) types in the compiler intrinsics that will do what you want.
Since you did a pretty good job solving your own problem with the machine code, I figured you deserved some help with the portable version. I would leave an
ifdef
in where you do just use the assembly if in gnu on x86.Anyway, here is an implementation based on my general answer. I'm pretty sure this is correct, but no guarantees, I just banged this out last night. You probably should get rid of the statics
positive_result[]
andresult_negative
- those are just artefacts of my unit test.If you're using a relatively recent GCC on x86_64:
At -O1 and higher, this compiles to what you want:
I believe that clang and VC++ also have support for the __int128_t type, so this should also work on those platforms, with the usual caveats about trying it yourself.
With regard to your assembly solution, don't hard-code the
mov
instructions! Let the compiler do it for you. Here's a modified version of your code:Helpful reference: Machine Constraints