c++ unordered_map collision handling , resize and rehash
This is a previous question opened by me and I have seen that I am having a lot of confusion about how unordered_map is implemented. I am sure many other people shares that confusion with me. Based on the information I have know without reading the standard:
Every unordered_map implementation stores a linked list to external nodes in the array of buckets... No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements
I was hoping that someone might explain the implementation and how it fits with the C++ standard definition ( in terms of performance requirements ) and if it is really not the most efficient way to implement an hash map data structure how it can be improved ?
The Standard effectively mandates
std::unordered_set
andstd::unordered_map
implementations that use open hashing, which means an array of buckets, each of which holds the head of a logical (and typically actual) list. That requirement is subtle: it's a consequence of the default max load factor being 1.0 and the guarantee that the table will not be rehashed unless grown beyond that load factor: that would be impractical without chaining, as the collisions with closed hashing become overwhelming as the load factor approaches 1:(To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next pointer there can be rewired if erasing the bucket's last value.)
Regarding the text you quote:
There is no "oversight"... what's done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many
insert
/erase
pairs without gradually degrading performance the way many closed hashing implementations do.As evidence of the awareness, from Matthew Austern's proposal here:
Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose.
A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.