I'm looking for an explanation of how a hash table works - in plain English for a simpleton like me!
For example, I know it takes the key, calculates the hash (I am looking for an explanation how) and then performs some kind of modulo to work out where it lies in the array where the value is stored, but that's where my knowledge stops.
Could anyone clarify the process?
Edit: I'm not asking specifically about how hash codes are calculated, but a general overview of how a hash table works.
How the hash is computed does usually not depend on the hashtable, but on the items added to it. In frameworks/base class libraries such as .net and Java, each object has a GetHashCode() (or similar) method returning a hash code for this object. The ideal hash code algorithm and the exact implementation depends on the data represented by in the object.
This is how it works in my understanding:
Here's an example: picture the entire table as a series of buckets. Suppose you have an implementation with alpha-numeric hash-codes and have one bucket for each letter of the alphabet. This implementation puts each item whose hash code begins with a particular letter in the corresponding bucket.
Let's say you have 200 objects, but only 15 of them have hash codes that begin with the letter 'B.' The hash table would only need to look up and search through the 15 objects in the 'B' bucket, rather than all 200 objects.
As far as calculating the hash code, there is nothing magical about it. The goal is just to have different objects return different codes and for equal objects to return equal codes. You could write a class that always returns the same integer as a hash-code for all instances, but you would essentially destroy the usefulness of a hash-table, as it would just become one giant bucket.
All of the answers so far are good, and get at different aspects of how a hashtable works. Here is a simple example that might be helpful. Lets say we want to store some items with lower case alphabetic strings as a keys.
As simon explained, the hash function is used to map from a large space to a small space. A simple, naive implementation of a hash function for our example could take the first letter of the string, and map it to an integer, so "alligator" has a hash code of 0, "bee" has a hash code of 1, "zebra" would be 25, etc.
Next we have an array of 26 buckets (could be ArrayLists in Java), and we put the item in the bucket that matches the hash code of our key. If we have more than one item that has a key that begins with the same letter, they will have the same hash code, so would all go in the bucket for that hash code so a linear search would have to be made in the bucket to find a particular item.
In our example, if we just had a few dozen items with keys spanning the alphabet, it would work very well. However, if we had a million items or all the keys all started with 'a' or 'b', then our hash table would not be ideal. To get better performance, we would need a different hash function and/or more buckets.
Here's another way to look at it.
I assume you understand the concept of an array A. That's something that supports the operation of indexing, where you can get to the Ith element, A[I], in one step, no matter how large A is.
So, for example, if you want to store information about a group of people who all happen to have different ages, a simple way would be to have an array that is large enough, and use each person's age as an index into the array. Thay way, you could have one-step access to any person's information.
But of course there could be more than one person with the same age, so what you put in the array at each entry is a list of all the people who have that age. So you can get to an individual person's information in one step plus a little bit of search in that list (called a "bucket"). It only slows down if there are so many people that the buckets get big. Then you need a larger array, and some other way to get more identifying information about the person, like the first few letters of their surname, instead of using age.
That's the basic idea. Instead of using age, any function of the person that produces a good spread of values can be used. That's the hash function. Like you could take every third bit of the ASCII representation of the person's name, scrambled in some order. All that matters is that you don't want too many people to hash to the same bucket, because the speed depends on the buckets remaining small.
This turns out to be a pretty deep area of theory, but the basic outline is simple.
Essentially, a hash function is just a function that takes things from one space (say strings of arbitrary length) and maps them to a space useful for indexing (unsigned integers, say).
If you only have a small space of things to hash, you might get away with just interpreting those things as integers, and you're done (e.g. 4 byte strings)
Usually, though, you've got a much larger space. If the space of things you allow as keys is bigger than the space of things you are using to index (your uint32's or whatever) then you can't possibly have a unique value for each one. When two or more things hash to the same result, you'll have to handle the redundancy in an appropriate way (this is usually referred to as a collision, and how you handle it or don't will depend a bit on what you are using the hash for).
This implies you want it to be unlikely to have the same result, and you probably also would really like the hash function to be fast.
Balancing these two properties (and a few others) has kept many people busy!
In practice you usually should be able to find a function that is known to work well for your application and use that.
Now to make this work as a hashtable: Imagine you didn't care about memory usage. Then you can create an array as long as your indexing set (all uint32's, for example). As you add something to the table, you hash it's key and look at the array at that index. If there is nothing there, you put your value there. If there is already something there, you add this new entry to a list of things at that address, along with enough information (your original key, or something clever) to find which entry actually belongs to which key.
So as you go a long, every entry in your hashtable (the array) is either empty, or contains one entry, or a list of entries. Retrieving is a simple as indexing into the array, and either returning the value, or walking the list of values and returning the right one.
Of course in practice you typically can't do this, it wastes too much memory. So you do everything based on a sparse array (where the only entries are the ones you actually use, everything else is implicitly null).
There are lots of schemes and tricks to make this work better, but that's the basics.
Lots of answers, but none of them are very visual, and hash tables can easily "click" when visualised.
Hash tables are often implemented as arrays of linked lists. If we imagine a table storing people's names, after a few insertions it might be laid out in memory as below, where
()
-enclosed numbers are hash values of the text/name.A few points:
[0]
,[1]
...) is known as a bucket, and starts a - possibly empty - linked list of values (aka elements, in this example - people's names)"fred"
with hash42
) is linked from bucket[hash % number_of_buckets]
e.g.42 % 10 == [2]
;%
is the modulus operator - the remainder when divided by the number of buckets42 % 10 == [2]
, and9282 % 10 == [2]
), but occasionally because the hash values are the same (e.g."fred"
and"jane"
both shown with hash42
above)If the table size grows, hash tables implemented as above tend to resize themselves (i.e. create a bigger array of buckets, create new/updated linked lists there-from, delete the old array) to keep the ratio of elements to buckets (aka load factor) somewhere in the 0.5 to 1.0 range. With load factor 1 and a cryptographic strength hash function, 36.8% of buckets will tend to be empty, another 36.8% have one element, 18.4% two elements, 6.1% three elements, 1.5% four elements, .3% five etc.. - the list lengths average 2.0 elements no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.
(Notes: not all hash tables use linked lists, but most general purpose ones do, as closed hashing (aka open addressing) - particularly with erase operations supported - has less stable performance properties with collision-prone keys/hash functions).
A few words on hash functions
A general purpose, worst-case collision-minimising hash function's job is to spray the keys around the hash table buckets effectively at random, while always generating the same hash value for the same key. Even one bit changing anywhere in the key would ideally - randomly - flip about half the bits in the resultant hash value.
This is normally orchestrated with maths too complicated for me to grok. I'll mention one easy-to-understand way - not the most scalable or cache friendly but inherently elegant (like encryption with a one-time pad!) - as I think it helps drive home the desirable qualities mentioned above. Say you were hashing 64-bit
double
s - you could create 8 tables each of 256 random numbers (i.e.size_t random[8][256]
), then use each 8-bit/1-byte slice of thedouble
's memory representation to index into a different table, XORing the random numbers you look up. With this approach, it's easy to see that a bit changing anywhere in thedouble
results in a different random number being looked up in one of the tables, and a totally uncorrelated final value.Still, many libraries' hashing functions pass integers through unchanged, which is extremely collision prone in the worst cases, but the hope is that in the fairly common case of integer keys that tend to be incrementing, they'll map into successive buckets leaving fewer empty than the 36.8% random hashing leaves, thereby having fewer collisions and fewer longer linked lists of colliding elements than is achieved by random mappings. It's also great to save the time it takes to generate a strong hash. When the keys don't increment nicely, the hope is they'll be random enough they won't need a strong hash function to totally randomise their placement into buckets.
Well, that was less fun and heavier going than the hash table explanation, but hope it helps someone....