A faster hash function

2020-06-07 05:08发布

I'm trying to implement my own hash function, i add up the ASCII numbers of each string, using java. I find the hash code by finding the mod of the size of the hash table and the sum. size%sum. I was wondering if there was a way to use the same process but reduce collisions, when searching for the string?

Thanks in advance.

2条回答
一夜七次
2楼-- · 2020-06-07 05:30

The Java String.hashcode() makes a tradeoff between being a really good hash function and being as efficient as possible. Simply adding up the character values in a string is not a reliable hash function.

For example, consider the two strings dog and god. Since they both contain a 'd', 'g', and an 'o', no method involving only addition will ever result in a different hash code.

Joshua Bloch, who implemented a good part of Java, discusses the String.hashCode() method in his book Effective Java and talks about how, in versions of Java prior to 1.3, the String.hashCode() function used to consider only 16 characters in a given String. This ran somewhat faster than the current implementation, but resulted is shockingly poor performance in certain situations.

In general, if your specific data set is very well-defined and you could exploit some uniqueness in it, you could probably make a better hash function. For general purpose Strings, good luck.

查看更多
我想做一个坏孩纸
3楼-- · 2020-06-07 05:31

I would look at the code for String and HashMap as these have a low collision rate and don't use % and handle negative numbers.

From the source for String

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

From the source for HashMap

/**
 * Retrieve object hash code and applies a supplemental hash function to the
 * result hash, which defends against poor quality hash functions.  This is
 * critical because HashMap uses power-of-two length hash tables, that
 * otherwise encounter collisions for hashCodes that do not differ
 * in lower bits. Note: Null keys always map to hash 0, thus index 0.
 */
final int hash(Object k) {
    int h = 0;
    if (useAltHashing) {
        if (k instanceof String) {
            return sun.misc.Hashing.stringHash32((String) k);
        }
        h = hashSeed;
    }

    h ^= k.hashCode();

    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

As the HashMap is always a power of 2 in size you can use

        hash = (null != key) ? hash(key) : 0;
        bucketIndex = indexFor(hash, table.length);

and

/**
 * Returns index for hash code h.
 */
static int indexFor(int h, int length) {
    return h & (length-1);
}

Using & is much faster than % and only return positive numbers as length is positive.

查看更多
登录 后发表回答