Gprof tells me that my computationally heavy program spends most of it's time (36%) hashing using AP-Hash.
I can't reduce the call count but I would still like to make it faster, can I call intrinsic SHA from a c program?
Do I need the intel compiler or can I stick with gcc?
Unless you work at Intel, you can't yet. SHA extensions have not yet been included on any released CPU; they are expected to be included in Intel's Skylake microarchitecture (which isn't expected until 2015 or 2016).
Moreover, the AP hash function is probably already faster than even an accelerated SHA would be. You may want to consider alternative approaches, such as optimizing the hash function or caching the results for hot values.
SHA instructions are now available in Goldmont architecture. It was released around September, 2016. According to the Intel Intrinsics Guide, these are the intrinsics of interest:
__m128i _mm_sha1msg1_epu32 (__m128i a, __m128i b)
__m128i _mm_sha1msg2_epu32 (__m128i a, __m128i b)
__m128i _mm_sha1nexte_epu32 (__m128i a, __m128i b)
__m128i _mm_sha1rnds4_epu32 (__m128i a, __m128i b, const int func)
__m128i _mm_sha256msg1_epu32 (__m128i a, __m128i b)
__m128i _mm_sha256msg2_epu32 (__m128i a, __m128i b)
__m128i _mm_sha256rnds2_epu32 (__m128i a, __m128i b, __m128i k)
GCC 5.0 and above make intrinsics available all the time for Function Specific Option Pragmas. You will need Binutils 2.24, however. Testing also shows Clang 3.7 and 3.8 support the intrinsics. Testing also shows Visual Studio 2015 can consume them, but VS2013 failed to compile them.
You can detect the availability of SHA in the preprocessor on Linux by looking for the macro
__SHA__
.-march=native
will make it available if its native to the processor. If not, you can enable it with-msha
.The code for using SHA1 is shown below. Its based on Intel's blog titled Intel® SHA Extensions. Another reference implementation is available from the miTLS project.
The code below is based on Intel® SHA Extensions blog. The code works with full SHA1 blocks, so
const uint32_t *data
is 64 bytes. You will have to add the padding for the final block and set the bit length.It runs at about 1.7 cycles-per-byte (cpb) on an Celeron J3455. I believe Andy Polyakov has SHA1 running around 1.5 cpb for OpenSSL. For reference, an optimized C/C++ implementation will run somewhere around 9 to 10 cpb.
You can tell if your processor supports the SHA extensions under Linux by looking for the
sha_ni
flag:Also see Are there in x86 any instructions to accelerate SHA (SHA1/2/256/512) encoding?
You can find source for both Intel SHA intrinsics and ARMv8 SHA intrinsics at Noloader GitHub | SHA-Intrinsics. They are C source files, and provide the compress function for SHA-1, SHA-224 and SHA-256. The intrinsic based implementations increase throughput approximately 3x to 4x for SHA-1, and approximately 6x to 12x for SHA-224 and SHA-256.