hash function for string

2019-01-01 04:42发布

问题:

I\'m working on hash table in C language and I\'m testing hash function for string.

The first function I\'ve tried is to add ascii code and use modulo (%100) but i\'ve got poor results with the first test of data: 40 collisions for 130 words.

The final input data will contain 8 000 words (it\'s a dictionnary stores in a file). The hash table is declared as int table[10000] and contains the position of the word in a txt file.

The first question is which is the best algorithm for hashing string ? and how to determinate the size of hash table ?

thanks in advance !

:-)

回答1:

I\'ve had nice results with djb2 by Dan Bernstein.

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}


回答2:

First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that\'s very fast by cryptographic standards is still excruciatingly slow by hash table standards.

Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.

For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:

int hash(char const *input) { 
    int result = 0x55555555;

    while (*input) { 
        result ^= *input++;
        result = rol(result, 5);
    }
}

Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).



回答3:

There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I\'d highly recommend using those rather than inventing your own hashtable or hash function; they\'ve been optimized heavily for common use-cases.

If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.



回答4:

Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.

uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
    uint32_t hash, i;
    for(hash = i = 0; i < len; ++i)
    {
        hash += key[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}


回答5:

First, is 40 collisions for 130 words hashed to 0..99 bad? You can\'t expect perfect hashing if you are not taking steps specifically for it to happen. An ordinary hash function won\'t have fewer collisions than a random generator most of the time.

A hash function with a good reputation is MurmurHash3.

Finally, regarding the size of the hash table, it really depends what kind of hash table you have in mind, especially, whether buckets are extensible or one-slot. If buckets are extensible, again there is a choice: you choose the average bucket length for the memory/speed constraints that you have.



回答6:

I have tried these hash functions and got the following result. I have about 960^3 entries, each 64 bytes long, 64 chars in different order, hash value 32bit. Codes from here.

Hash function  |  collision rate | how many minutes to finish
MurmurHash3    |    6.?%         |       4m15s
Jenkins One..  |    6.1%         |       6m54s   
Bob, 1st in link|   6.16%        |       5m34s
SuperFastHash  |    10%          |       4m58s
bernstein      |    20%          | 14s only finish 1/20
one_at_a_time  |    6.16%        |       7m5s
crc            |    6.16%        |       7m56s

One strange things is that almost all the hash functions have 6% collision rate for my data.



回答7:

Though djb2, as presented on stackoverflow by cnicutar, is almost certainly better, I think it\'s worth showing the K&R hashes too:

1) Apparently a terrible hash algorithm, as presented in K&R 1st edition (source)

unsigned long hash(unsigned char *str)
{
    unsigned int hash = 0;
    int c;

    while (c = *str++)
        hash += c;

    return hash;
}

2) Probably a pretty decent hash algorithm, as presented in K&R version 2 (verified by me on pg. 144 of the book); NB: be sure to remove % HASHSIZE from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and \"hashval\" type unsigned long instead of the simple unsigned (int).

unsigned hash(char *s)
{
    unsigned hashval;

    for (hashval = 0; *s != \'\\0\'; s++)
        hashval = *s + 31*hashval;
    return hashval % HASHSIZE;
}

Note that it\'s clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so hash(\"ab\") would therefore return the same value as hash(\"ba\"). This is not so with the 2nd edition hash, however, which would (much better!) return two different values for those strings.

The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.

  • This is a partial answer to the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of \"MurmurHashUnaligned2\", by Austin Appleby (http://murmurhash.googlepages.com/).
  • In the file \"gcc/libstdc++-v3/libsupc++/hash_bytes.cc\", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here\'s the one for the \"32-bit size_t\" return value, for example (pulled 11 Aug 2017):

Code:

// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
  const size_t m = 0x5bd1e995;
  size_t hash = seed ^ len;
  const char* buf = static_cast<const char*>(ptr);

  // Mix 4 bytes at a time into the hash.
  while (len >= 4)
  {
    size_t k = unaligned_load(buf);
    k *= m;
    k ^= k >> 24;
    k *= m;
    hash *= m;
    hash ^= k;
    buf += 4;
    len -= 4;
  }

  // Handle the last few bytes of the input array.
  switch (len)
  {
    case 3:
      hash ^= static_cast<unsigned char>(buf[2]) << 16;
      [[gnu::fallthrough]];
    case 2:
      hash ^= static_cast<unsigned char>(buf[1]) << 8;
      [[gnu::fallthrough]];
    case 1:
      hash ^= static_cast<unsigned char>(buf[0]);
      hash *= m;
  };

  // Do a few final mixes of the hash.
  hash ^= hash >> 13;
  hash *= m;
  hash ^= hash >> 15;
  return hash;
}


回答8:

One thing I\'ve used with good results is the following (I don\'t know if its mentioned already because I can\'t remember its name).

You precompute a table T with a random number for each character in your key\'s alphabet [0,255]. You hash your key \'k0 k1 k2 ... kN\' by taking T[k0] xor T[k1] xor ... xor T[kN]. You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers.