Why in the world was _mm_crc32_u64(...)
defined like this?
unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v );
The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). If the machine instruction CRC32 happens to have a 64-bit destination operand, the upper 32 bits are ignored, and filled with 0's on completion, so there is NO use to EVER have a 64-bit destination. I understand why Intel allowed a 64-bit destination operand on the instruction (for uniformity), but if I want to process data quickly, I want a source operand as large as possible (i.e. 64-bits if I have that much data left, smaller for the tail ends) and always a 32-bit destination operand. But the intrinsics don't allow a 64-bit source and 32-bit destination. Note the other intrinsics:
unsigned int _mm_crc32_u8 ( unsigned int crc, unsigned char v );
The type of "crc" is not an 8-bit type, nor is the return type, they are 32-bits. Why is there no
unsigned int _mm_crc32_u64 ( unsigned int crc, unsigned __int64 v );
? The Intel instruction supports this, and that is the intrinsic that makes the most sense.
Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks. My guess is something like this:
#define CRC32(D32,S) __asm__("crc32 %0, %1" : "+xrm" (D32) : ">xrm" (S))
for GCC, and
#define CRC32(D32,S) __asm { crc32 D32, S }
for VisualStudio. Unfortunately I have little understanding of how constraints work, and little experience with the syntax and semantics of assembly level programming.
Small edit: note the macros I've defined:
#define GET_INT64(P) *(reinterpret_cast<const uint64* &>(P))++
#define GET_INT32(P) *(reinterpret_cast<const uint32* &>(P))++
#define GET_INT16(P) *(reinterpret_cast<const uint16* &>(P))++
#define GET_INT8(P) *(reinterpret_cast<const uint8 * &>(P))++
#define DO1_HW(CR,P) CR = _mm_crc32_u8 (CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = _mm_crc32_u16(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = _mm_crc32_u32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = (_mm_crc32_u64((uint64)CR, GET_INT64(P))) & 0xFFFFFFFF;
Notice how different the last macro statement is. The lack of uniformity is certainly and indication that the intrinsic has not been defined sensibly. While it is not necessary to put in the explicit (uint64)
cast in the last macro, it is implicit and does happen. Disassembling the generated code shows code for both casts 32->64 and 64->32, both of which are unnecessary.
Put another way, it's _mm_crc32_u64
, not _mm_crc64_u64
, but they've implemented it as if it were the latter.
If I could get the definition of CRC32
above correct, then I would want to change my macros to
#define DO1_HW(CR,P) CR = CRC32(CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = CRC32(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = CRC32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = CRC32(CR, GET_INT64(P))