When given a static set of objects (static in the sense that once loaded it seldom if ever changes) into which repeated concurrent lookups are needed with optimal performance, which is better, a HashMap
or an array with a binary search using some custom comparator?
Is the answer a function of object or struct type? Hash and/or Equal function performance? Hash uniqueness? List size? Hashset
size/set size?
The size of the set that I'm looking at can be anywhere from 500k to 10m - incase that information is useful.
While I'm looking for a C# answer, I think the true mathematical answer lies not in the language, so I'm not including that tag. However, if there are C# specific things to be aware of, that information is desired.
For very small collections the difference is going to be negligible. At the low end of your range (500k items) you will start to see a difference if you're doing lots of lookups. A binary search is going to be O(log n), whereas a hash lookup will be O(1), amortized. That's not the same as truly constant, but you would still have to have a pretty terrible hash function to get worse performance than a binary search.
(When I say "terrible hash", I mean something like:
Yeah, it's blazing fast itself, but causes your hash map to become a linked list.)
ialiashkevich wrote some C# code using an array and a Dictionary to compare the two methods, but it used Long values for keys. I wanted to test something that would actually execute a hash function during the lookup, so I modified that code. I changed it to use String values, and I refactored the populate and lookup sections into their own methods so it's easier to see in a profiler. I also left in the code that used Long values, just as a point of comparison. Finally, I got rid of the custom binary search function and used the one in the
Array
class.Here's that code:
Here are the results with several different sizes of collections. (Times are in milliseconds.)
And for comparison, here's the profiler output for the last run of the program (10 million records and lookups). I highlighted the relevant functions. They pretty closely agree with the Stopwatch timing metrics above.
You can see that the Dictionary lookups are much faster than binary search, and (as expected) the difference is more pronounced the larger the collection. So, if you have a reasonable hashing function (fairly quick with few collisions), a hash lookup should beat binary search for collections in this range.
I'd say it depends mainly on the performance of the hash and compare methods. For example, when using string keys that are very long but random, a compare will always yield a very quick result, but a default hash function will process the entire string.
But in most cases the hash map should be faster.
Here it's described how hashes are built and because the Universe of keys is reasonably big and hash functions are built to be "very injective" so that collisions rarely happen the access time for a hash table is not O(1) actually ... it's something based on some probabilities. But,it is reasonable to say that the access time of a hash is almost always less than the time O(log_2(n))
This is more a comment to Bill's answer because his answer have so many upvotes even though its wrong. So I had to post this.
I see lots of discussion about what is the worst case complexity of a lookup in hashtable, and what is considered amortized analysis / what is not. Please check the link below
Hash table runtime complexity (insert, search and delete)
worst case complexity is O(n) and not O(1) as opposed to what Bill says. And thus his O(1) complexity is not amortized since this analysis can only be used for worst cases (also his own wikipedia link says so)
https://en.wikipedia.org/wiki/Hash_table
https://en.wikipedia.org/wiki/Amortized_analysis