Hashing a string for use in hash table (Double Has

2019-08-29 02:30发布

问题:

I am trying to use Double Hashing to hash a String key into a hash table. I did something like:

protected int getIndex(String key) {
  int itr = 0,
      size = this.values.length,
      index1,
      index2,
      index = 0;

  do {
    // do double hashing to get index for curr [itr] (iteration)
    index1 = Math.abs(key.hashCode()) % size;
    index2 = size - ((key + key + "#!@").hashCode() % size); # trying very hard to eliminate clash, but still fails ... TA and AT gets index 2 when size = 5
    index = (index1 + (itr * index2)) % size;

    // if itr > set threshold, exit
    itr++;
    if (itr > 200) {
      index = -1;
      break;
    }

    // once index found, exit loop
  } while (index > 0 && this.keys[index] != null && !this.keys[index].equals(key));

  return index;
}

Main part is the 1st 3 lines after the do. Can I say if I use Double Hashing, it should eliminate the probability of collision? size is total possible values of unique keys for my hash table

回答1:

So I see two things going on here

  1. Using two different hashes and combining them in attempt to get a more distributed hash
  2. If a hash fails, trying a new spot a little farther along

At first blush, it looks like both of these are a good way to reduce hash collisions. However, on closer inspection, both of these fall into real algorithmic trouble.

Combining two hashes
Hashing algorithms are designed to be fairly well distributed across the integer spectrum. Just like how adding two random numbers together doesn't give you anything more randomer, adding two hashes together doesn't get you something more distributed. In fact, adding two idential distributions together will ALWAYS give you something less evenly distributed. Thus any kind of double hashing strategy, using the same underlying algorithm, is worse than a single hashing strategy.

Trying a new spot
It's tempting to try an algorithm that tries a new hash if the first one collides. However, this causes problems with the retrieval part of the algorithm. When you put something in the hash, and it bumps to the another spot. Then when you go to retrieve the value, it's not there. Worse yet, whether or not you even find it depends on if the first element is still there or not. If it's been removed, then it's impossible to tell if item you're looking for is further along, or if it just isn't there. Ultimately a .contains test would have to go through all 200 iterations before it could be sure that the hash it's looking for isn't there.

The best solution is to use the out of the box hash provided by Java. If you're getting a lot of collisions, the best thing to use a lower load factor in the hash. This increases the number of buckets, and causes collisions to be less likely.