I have two 128 bit numbers in memory in hexadecimal, for example (little endian):
x:0x12 0x45 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
y:0x36 0xa1 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
I've to perform the unsigned multiplication between these two numbers so my new number will be:
z:0xcc 0xe3 0x7e 0x2b 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Now, I'm aware that I can move the half x and y number into rax
and rbx
registers and, for example, do the mul
operation, and do the same with the other half. The problem is that by doing so I lose the carry-over and I've no idea how I can avoid that. It's about 4 hours I'm facing this problem and the only solution that can I see is the conversion in binary (and
<-> shl,1
).
Can you give me some input about this problem?
I think the best solution is to take one byte par time.
Let μ = 264, then we can decompose your 128 bit numbers a and b into a = a1μ + a2 and b = b1μ + b2. Then we can compute c = ab with 64 · 64 → 128 bit multiplications by first computing partial products:
and then accumulating them into a 256 bit result (watch the overflow when doing the additions!):
As usual, ask a compiler how to do something efficiently: GNU C on 64-bit platforms supports
__int128_t
and__uint128_t
.compiles to (gcc6.2
-O3
on Godbolt)Since this is targeting the x86-64 System V calling convention,
a
is in RSI:RDI, whileb
is in RCX:RDX. The result is returned in RDX:RAX.Pretty nifty that it only takes one MOV instruction, since gcc doesn't need the high-half result of a_upper * b_lower or vice versa. It can destroy the high halves of the inputs with the faster 2-operand form of IMUL since they're only used once.
With
-march=haswell
to enable BMI2, gcc uses MULX to avoid even the one MOV.Sometimes compiler output isn't perfect, but very often the general strategy is a good starting point for optimizing by hand.
Of course, if what you really wanted in the first place was 128-bit multiplies in C, just use the compiler's built-in support for it. That lets the optimizer do its job, often giving better results than if you'd written a couple parts in inline-asm. (https://gcc.gnu.org/wiki/DontUseInlineAsm).