What are the disadvantages of hashing function usi

2019-02-10 15:54发布

问题:

There are two basic methods for implementing a hash function that are cited in pretty much every text book and CS courses:

  1. Division method where we simply do k mod m essentially picking m as prime not too close to power of 2.
  2. Multiplication method where we multiply k with some well-chosen irrational number (Knuth suggests using number based on Golden ratio) between 0 to 1, take the fractional part of the product and use desired number of most significant bits from it.

Most text books and courses cite several disadvantages for method 1 including the fact that it's expensive and things depends on m. However I've never yet seen any textbook or courses mention single disadvantage for method 2.

This makes method 2 more desirable. Plus the method 2 can be made very efficient on modern computers eliminating floating point arithmetic all together. So it would look like that method 2 is hands down winner and no body should be talking about method 1. But that's obviously not the case. In fact, I've never seen method 2 getting used in any practical implementations. So it does have some disadvantages.

Question is what are they and why method 1 gets used more often despite of its disadvantages?

回答1:

Division method is used in conjunction with hash table algorithms which require prime table size - for example, open addressing with double hashing or QHash, when you anyway need to divide the key, or it's hash, by table size to get the index.

Multiplication method is suitable when the table size is a power of two, then getting the index from the hash could be implemented as bitwise AND operation, therefore the whole path of computing the table index by the key, with multiplication hashing, is very fast. You can explore some actual implementations by searching for magic constant 2654435769 on Github.

There is a recent trend of using MurmurHash3 avalanche procedure instead of multiplication method:

int hash = key;
hash ^= (hash >> 16);
hash *= 0x85ebca6b;
hash ^= (hash >> 13);
hash *= 0xc2b2ae35;
hash ^= (hash >> 16);
// see this code and the version for 64 bits here:
// https://smhasher.googlecode.com/svn/trunk/MurmurHash3.cpp

Because it is just a bit slower, but considered more robust to bad key distribution. That is why you might get wrong (or right?) impression that multiplication method is used unfairly rarely.