可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Why in the world was _mm_crc32_u64(...) defined like this?

unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v );

The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). If the machine instruction CRC32 happens to have a 64-bit destination operand, the upper 32 bits are ignored, and filled with 0's on completion, so there is NO use to EVER have a 64-bit destination. I understand why Intel allowed a 64-bit destination operand on the instruction (for uniformity), but if I want to process data quickly, I want a source operand as large as possible (i.e. 64-bits if I have that much data left, smaller for the tail ends) and always a 32-bit destination operand. But the intrinsics don't allow a 64-bit source and 32-bit destination. Note the other intrinsics:

unsigned int _mm_crc32_u8 ( unsigned int crc, unsigned char v );

The type of "crc" is not an 8-bit type, nor is the return type, they are 32-bits. Why is there no

unsigned int _mm_crc32_u64 ( unsigned int crc, unsigned __int64 v );

? The Intel instruction supports this, and that is the intrinsic that makes the most sense.

Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks. My guess is something like this:

#define CRC32(D32,S) __asm__("crc32 %0, %1" : "+xrm" (D32) : ">xrm" (S))

for GCC, and

#define CRC32(D32,S) __asm { crc32 D32, S }

for VisualStudio. Unfortunately I have little understanding of how constraints work, and little experience with the syntax and semantics of assembly level programming.

Small edit: note the macros I've defined:

#define GET_INT64(P) *(reinterpret_cast<const uint64* &>(P))++
#define GET_INT32(P) *(reinterpret_cast<const uint32* &>(P))++
#define GET_INT16(P) *(reinterpret_cast<const uint16* &>(P))++
#define GET_INT8(P)  *(reinterpret_cast<const uint8 * &>(P))++


#define DO1_HW(CR,P) CR =  _mm_crc32_u8 (CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR =  _mm_crc32_u16(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR =  _mm_crc32_u32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = (_mm_crc32_u64((uint64)CR, GET_INT64(P))) & 0xFFFFFFFF;

Notice how different the last macro statement is. The lack of uniformity is certainly and indication that the intrinsic has not been defined sensibly. While it is not necessary to put in the explicit (uint64) cast in the last macro, it is implicit and does happen. Disassembling the generated code shows code for both casts 32->64 and 64->32, both of which are unnecessary.

Put another way, it's _mm_crc32_u64, not _mm_crc64_u64, but they've implemented it as if it were the latter.

If I could get the definition of CRC32 above correct, then I would want to change my macros to

#define DO1_HW(CR,P) CR = CRC32(CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = CRC32(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = CRC32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = CRC32(CR, GET_INT64(P))

回答1:

Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.

My friend and I wrote a c++ sse intrinsics wrapper which contains the more preferred usage of the crc32 instruction with 64bit src.

http://code.google.com/p/sse-intrinsics/

See the i_crc32() instruction. (sadly there are even more flaws with intel's sse intrinsic specifications on other instructions, see this page for more examples of flawed intrinsic design)

回答2:

The 4 intrinsic functions provided really do allow all possible uses of the Intel defined CRC32 instruction. The instruction output always 32-bits because the instruction is hard-coded to use a specific 32-bit CRC polynomial. However, the instruction allows your code to feed input data to it 8, 16, 32, or 64 bits at a time. Processing 64-bits at a time should maximize throughput. Processing 32-bits at a time is the best you can do if restricted to 32-bit build. Processing 8 or 16 bits at a time could simplify your code logic if the input byte count is odd or or not a multiple of 4/8.

#include <stdio.h>
#include <stdint.h>
#include <intrin.h>

int main (int argc, char *argv [])
    {
    int index;
    uint8_t *data8;
    uint16_t *data16;
    uint32_t *data32;
    uint64_t *data64;
    uint32_t total1, total2, total3;
    uint64_t total4;
    uint64_t input [] = {0x1122334455667788, 0x1111222233334444};

    total1 = total2 = total3 = total4 = 0;
    data8  = (void *) input;
    data16 = (void *) input;
    data32 = (void *) input;
    data64 = (void *) input;

    for (index = 0; index < sizeof input / sizeof *data8; index++)
        total1 = _mm_crc32_u8 (total1, *data8++);

    for (index = 0; index < sizeof input / sizeof *data16; index++)
        total2 = _mm_crc32_u16 (total2, *data16++);

    for (index = 0; index < sizeof input / sizeof *data32; index++)
        total3 = _mm_crc32_u32 (total3, *data32++);

    for (index = 0; index < sizeof input / sizeof *data64; index++)
        total4 = _mm_crc32_u64 (total4, *data64++);

    printf ("CRC32 result using 8-bit chunks: %08X\n", total1);
    printf ("CRC32 result using 16-bit chunks: %08X\n", total2);
    printf ("CRC32 result using 32-bit chunks: %08X\n", total3);
    printf ("CRC32 result using 64-bit chunks: %08X\n", total4);
    return 0;
    }