I am building a hash table, where the key is a phone number (here are some of them):
6948060987
6960780800
6963208768
6944870406
6947279288
6953691771
6956094283
6947092062
6960086297
6947719197
6951516975
6957531584
6969211184
6963238579
6957054322
6952077216
6956907738
The number of entries will be 200, 2000, 20000 and 2000000 and the entries will be unique.
About the size of the table, I am following this answer.
I store the phone number as an array of char
's. I noticed that all the numbers begin with 69, so I can skip them in the hash function.
My attempt is to take the sum of the digits and do a modulo with the number of cells in the hash table, but it seems (on paper) that this is a bad function, since there are many collisions.
How should I modify my hash function to get better results (less collisions)?
Why do you need to a non-standard hash function at all?
There are plenty of hash functions which are well tested and have known properties which will work fine for any input, thus will also work well for phone numbers, which are after all a subset of ASCII strings. Is your application so time critical that you need to design your own hash function and risk something with more collisions? If not, why not use one of the well known hash functions?
For instance, if you need something with cryptographically demonstrable collision resistance, use SHA-256 (truncated if you want). If you are not worried about an adversary, use something like universal hashing. Unless your problem is very specialised, you will be better off using someone else's well tested hash algorithm than trying to invent one yourself.
An even easier hash is the original hash perl used, which worked as follows:
In English, it takes the current hash value, multiplies by 33, and adds the ASCII value of the next character on. It's not a great hash, but it worked for perl for a long while.