可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

How to multiply two 64-bit integers by another 2 64-bit integers? I didn't find any instruction which can do it.

回答1:

I know this is an old question but I was actually looking for exactly this. As there's still no instruction for it I implemented the 64 bit multiply myself with the pmuldq as Paul R mentioned. This is what I came up with:

// requires g++ -msse4.1 ...

#include <emmintrin.h>
#include <smmintrin.h>

__m128i Multiply64Bit(__m128i a, __m128i b)
{
    auto ax0_ax1_ay0_ay1 = a;
    auto bx0_bx1_by0_by1 = b;

    // i means ignored

    auto ax1_i_ay1_i = _mm_shuffle_epi32(ax0_ax1_ay0_ay1, _MM_SHUFFLE(3, 3, 1, 1));
    auto bx1_i_by1_i = _mm_shuffle_epi32(bx0_bx1_by0_by1, _MM_SHUFFLE(3, 3, 1, 1));

    auto ax0bx0_ay0by0 = _mm_mul_epi32(ax0_ax1_ay0_ay1, bx0_bx1_by0_by1);
    auto ax0bx1_ay0by1 = _mm_mul_epi32(ax0_ax1_ay0_ay1, bx1_i_by1_i);
    auto ax1bx0_ay1by0 = _mm_mul_epi32(ax1_i_ay1_i, bx0_bx1_by0_by1);

    auto ax0bx1_ay0by1_32 = _mm_slli_epi64(ax0bx1_ay0by1, 32);
    auto ax1bx0_ay1by0_32 = _mm_slli_epi64(ax1bx0_ay1by0, 32);

    return _mm_add_epi64(ax0bx0_ay0by0, _mm_add_epi64(ax0bx1_ay0by1_32, ax1bx0_ay1by0_32));
}

Godbolt at SSE Multiply64Bit.

回答2:

You would need to implement your own 64 bit multiplication routine using 32 bit multiply operations. It's probably not going to be any more efficient than just doing this with scalar code though, particularly as there will be a lot of shuffling of the vectors to get all the required operations.

回答3:

Late answer, but this is a better version of what Barabas posted.

If you have ever used GCC or Clang's vector extensions, this is the routine they use.

This uses the same method that long multiplication and grid multiplication use.

    65
  * 73
  ----
    15 //   (5 * 3)
   180 //   (6 * 3) * 10
   350 //   (5 * 7) * 10
+ 4200 // + (6 * 7) * 100
------
  4745

However, instead of doing each unit of 10, it uses each unit of 32 bits, and it leaves out the last multiply because it will always be shifted past the 64th bit, just like how you wouldn't multiply 6*7 if you were truncating values greater than 99.

#include <emmintrin.h>

/*
 * Grid/long multiply two 64-bit SSE lanes.
 * Works for both signed and unsigned.
 *   ----------------.--------------.----------------.
 *  |                |   b >> 32    | a & 0xFFFFFFFF |
 *  |----------------|--------------|----------------|  
 *  | d >> 32        |   b*d << 64  |    a*d << 32   |
 *  |----------------|--------------|----------------|
 *  | c & 0xFFFFFFFF |   b*c << 32  |       a*c      |
 *  '----------------'--------------'----------------'
 *  Add all of them together to get the product.
 *
 *  Because we truncate the value to 64 bits, b*d << 64 will be zero,
 *  so we can leave it out.
 *
 *  We also can add a*d and b*c first and then shift because of the
 *  distributive property: (a << 32) + (b << 32) == (a + b) << 32.
 */

__m128i Multiply64Bit(__m128i ab, __m128i cd)
{
    /* ac = (ab & 0xFFFFFFFF) * (cd & 0xFFFFFFFF); */
    __m128i ac = _mm_mul_epu32(ab, cd);

    /* b = ab >> 32; */
    __m128i b = _mm_srli_epi64(ab, 32);

    /* bc = b * (cd & 0xFFFFFFFF); */
    __m128i bc = _mm_mul_epu32(b, cd);

    /* d = cd >> 32; */
    __m128i d = _mm_srli_epi64(cd, 32);

    /* ad = (ab & 0xFFFFFFFF) * d; */
    __m128i ad = _mm_mul_epu32(ab, d);

    /* high = bc + ad; */
    __m128i high = _mm_add_epi64(bc, ad);

    /* high <<= 32; */
    high = _mm_slli_epi64(high, 32);

    /* return ac + high; */
    return _mm_add_epi64(high, ac);
}

Compiler Explorer Note: The GCC vector extension version was also included below for comparison.