I am trying to get my head around the details of a HAMT. I'd have implemented one myself in Java just to understand. I am familiar with Tries and I think I get the main concept of the HAMT.
Basically,
Two types of nodes:
Key/Value
Key Value Node:
K key
V value
Index
Index Node:
int bitmap (32 bits)
Node[] table (max length of 32)
- Generate a 32-bit hash for an object.
- Step through the hash 5-bits at a time.
(0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-31)
note: the last step (7th) is only 2 bits.
- At each step, find the position of that 5-bit integer in the bitmap. e.g.
integer==5 bitmap==00001
- If the bit is a 1 then that part of the hash exist.
- If the bit is a 0 then key doesn't exist.
- If the key exists, find it's index into the table by counting the number of 1s in the bitmap between 0 and the position. e.g.
integer==6 bitmap==0101010101 index==3
- If the table points to a key/value node then compare the keys.
- If the table points to a index node then go to 2 moving ahead one step.
The part I don't quite understand is collision detection and mitigation. In the linked paper he alludes to it:
The existing key is then inserted in the new sub-hash table and the
new key added. Each time 5 more bits of the hash are used the
probability of a collision reduces by a factor of 1/32. Occasionally
an entire 32 bit hash may be consumed and a new one must be computed
to differentiate the two keys.
If I were to compute a "new" hash and store the object at that new hash; how would you ever be able to look-up the object in the structure? When doing a look-up, wouldn't it generate the "initial" hash and not the "re-computed hash".
I must be missing something.....
BTW: The HAMT performs fairly well, it's sits between a hash map and tree map in my tests.
Data Structure Add time Remove time Sorted add time Sorted remove time Lookup time Size
Java's Hash Map 38.67 ms 18 ms 30 ms 15 ms 16.33 ms 8.19 MB
HAMT 68.67 ms 30 ms 57.33 ms 35.33 ms 33 ms 10.18 MB
Java's Tree Map 86.33 ms 78 ms 75 ms 39.33 ms 76 ms 8.79 MB
Java's Skip List Map 111.33 ms 106 ms 72 ms 41 ms 72.33 ms 9.19 MB
There's two sections of the paper I think you might of missed. The first is the bit immediately preceding the bit you quoted:
Or the key will collide with an existing one. In which case the existing key
must be replaced with a sub-hash table and the next 5 bit hash of the existing key
computed. If there is still a collision then this process is repeated until no collision
occurs.
So if you have object A in the table and you add object B which clashes, the cell at which their keys clashed will be a pointer to another subtable (where they don't clash).
Next, Section 3.7 of the paper you linked describes the method for generating a new hash when you run off the end of your first 32 bits:
The hash function was tailored to give a 32 bit hash. The algorithm requires that
the hash can be extended to an arbitrary number of bits. This was accomplished by
rehashing the key combined with an integer representing the trie level, zero being
the root. Hence if two keys do give the same initial hash then the rehash has a
probability of 1 in 2^32 of a further collision.
If this doesn't seem to explain anything, say and I'll extend this answer with more detail.
HAMT is great and highly performant structure especially when one needs immutable objects, i.e. each time after any modification a new copy of a data structure is created!
As for your question on hash collisions, I have found a C# implementation implementation (which is buggy now) that shows how it works: on each hash collision the key is rehashed and lookup is retried recursively until maximum iterations limit is reached.
Currently I am also exploring HAMP in functional programming context and learning existing code. There are several reference implementations of HAMT in Haskell as Data.HshMap and in Clojure as PersistenceHashMap.
There are some other simpler implementations on the web that do not deal with collisions, but they are useful to understand the concept. Here they are in Haskell and OCaml
I have found a nice summary article article that describes HAMT with pictures and links to original research papers by Phil Bagwell.
Related points:
While implementing HAMT in F# I have noticed that popCount function implementation described here really matters and gives 10-15% compared to naive implementation described in the next answers in the link. Not great, but a free lunch.
Related IntMap structures (Haskell and its port to F#) are very good when the key could be an integer and they implement related PATRICIA/Radix trie.
I believe all these implementation are very good to learn efficient immutable data structure and functional languages in all their beauty on these examples - they really shine together!
If I were to compute a "new" hash and store the object at that new
hash; how would you ever be able to look-up the object in the
structure? When doing a look-up, wouldn't it generate the "initial"
hash and not the "re-computed hash".
When doing a look-up the initial hash is used. When the bits in the initial
hash is exhausted, either one of the following condition is true:
- we end up with a key/value node - return it
- we end up with an index node - this is the hint that we have to go
deeper by recomputing a new hash.
The key here is hash bits exhaustion.
The chance of collision is presumably very low, and generally only problematic for huge trees. Given this, you're better off just storing collisions in an array at the leaf and searching it linearly (I do this in my C# HAMT).