Porting MeiYan hash function to Go

2019-05-19 10:18发布

问题:

I wanted to port a state-of-the-art hash function MeiYan from C to Go. (As far as I know this is one of the best if not just the best hash function for hash tables in terms of speed and collision rate, it beats MurMur at least.)

I am new to Go, just spent one weekend with it, and came up with this version:

func meiyan(key *byte, count int) uint32 {
    type P *uint32;
    var h uint32 = 0x811c9dc5;
    for ;count >= 8; {
        a := ((*(*uint32)(unsafe.Pointer(key))) << 5)
        b := ((*(*uint32)(unsafe.Pointer(key))) >> 27)
        c := *(*uint32)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 4))
        h = (h ^ ((a | b) ^ c)) * 0xad3e7
        count -= 8
        key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 8))
    }
    if (count & 4) != 0 {
        h = (h ^ uint32(*(*uint16)(unsafe.Pointer(key)))) * 0xad3e7
        key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 2))
        h = (h ^ uint32(*(*uint16)(unsafe.Pointer(key)))) * 0xad3e7
        key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 2))
    }
    if (count & 2) != 0 {
        h = (h ^ uint32(*(*uint16)(unsafe.Pointer(key)))) * 0xad3e7
        key = (*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(key)) + 2))
    }
    if (count & 1) != 0 {
        h = (h ^ uint32(*key));
        h = h * 0xad3e7
    }
    return h ^ (h >> 16);
}

Looks messy, but I do not think I can make it look better. Now I measure the speed and it is frustratingly slow, 3 times slower than C/C++ when compiled with gccgo -O3. Can this be made faster? Is this just as good as compiler can make it or unsafe.Pointer conversion is just as slow as it gets? In fact this surprised me, because I have seen that some other number crunching style code was just as fast as C or even faster. Am I doing something inneficiently here?

Here is the original C code I am porting from:

u32 meiyan(const char *key, int count) {
    typedef u32* P;
    u32 h = 0x811c9dc5;
    while (count >= 8) {
        h = (h ^ ((((*(P)key) << 5) | ((*(P)key) >> 27)) ^ *(P)(key + 4))) * 0xad3e7;
        count -= 8;
        key += 8;
    }
    #define tmp h = (h ^ *(u16*)key) * 0xad3e7; key += 2;
    if (count & 4) { tmp tmp }
    if (count & 2) { tmp }
    if (count & 1) { h = (h ^ *key) * 0xad3e7; }
    #undef tmp
    return h ^ (h >> 16);
}

Here is how I measure speed:

func main(){
    T := time.Now().UnixNano()/1e6
    buf := []byte("Hello World!")
    var controlSum uint64 = 0
    for x := 123; x < 1e8; x++ {
        controlSum += uint64(meiyan(&buf[0], 12))
    }
    fmt.Println(time.Now().UnixNano()/1e6 - T, "ms")
    fmt.Println("controlSum:", controlSum)
}

回答1:

After some careful research I found out why my code was slow, and improved it so it is now faster than the C version in my tests:

package main

import (
    "fmt"
    "time"
    "unsafe"
)

func meiyan(key *byte, count int) uint32 {
    type un unsafe.Pointer
    type p32 *uint32
    type p16 *uint16
    type p8 *byte
    var h uint32 = 0x811c9dc5;
    for ;count >= 8; {
        a := *p32(un(key)) << 5
        b := *p32(un(key)) >> 27
        c := *p32(un(uintptr(un(key)) + 4))
        h = (h ^ ((a | b) ^ c)) * 0xad3e7
        count -= 8
        key = p8(un(uintptr(un(key)) + 8))
    }
    if (count & 4) != 0 {
        h = (h ^ uint32(*p16(un(key)))) * 0xad3e7
        key = p8(un(uintptr(un(key)) + 2))
        h = (h ^ uint32(*p16(un(key)))) * 0xad3e7
        key = p8(un(uintptr(un(key)) + 2))
    }
    if (count & 2) != 0 {
        h = (h ^ uint32(*p16(un(key)))) * 0xad3e7
        key = p8(un(uintptr(un(key)) + 2))
    }
    if (count & 1) != 0 {
        h = h ^ uint32(*key)
        h = h * 0xad3e7
    }
    return h ^ (h >> 16);
}

func main() {
    T := time.Now().UnixNano()/1e6
    buf := []byte("ABCDEFGHABCDEFGH")
    var controlSum uint64 = 0
    start := &buf[0]
    size := len(buf)
    for x := 123; x < 1e8; x++ {
        controlSum += uint64(meiyan(start, size))
    }
    fmt.Println(time.Now().UnixNano()/1e6 - T, "ms")
    fmt.Println("controlSum:", controlSum)
}

The hash function itself was already fast, but dereferencing the array on each iteration is what made it slow: &buf[0] was replaced with start := &buf[0] and then use start on each iteration.



回答2:

The implementation from NATS looks impressive! On my machine, for a data of length 30 (bytes) op/sec 157175656.56 and nano-sec/op 6.36! Take a look at it. You might find some ideas.