When given a static set of objects (static in the sense that once loaded it seldom if ever changes) into which repeated concurrent lookups are needed with optimal performance, which is better, a HashMap
or an array with a binary search using some custom comparator?
Is the answer a function of object or struct type? Hash and/or Equal function performance? Hash uniqueness? List size? Hashset
size/set size?
The size of the set that I'm looking at can be anywhere from 500k to 10m - incase that information is useful.
While I'm looking for a C# answer, I think the true mathematical answer lies not in the language, so I'm not including that tag. However, if there are C# specific things to be aware of, that information is desired.
If your set of objects is truly static and unchanging, you can use a perfect hash to get O(1) performance guaranteed. I've seen gperf mentioned a few times, though I've never had occasion to use it myself.
Hashes are typically faster, although binary searches have better worst-case characteristics. A hash access is typically a calculation to get a hash value to determine which "bucket" a record will be in, and so the performance will generally depend on how evenly the records are distributed, and the method used to search the bucket. A bad hash function (leaving a few buckets with a whole lot of records) with a linear search through the buckets will result in a slow search. (On the third hand, if you're reading a disk rather than memory, the hash buckets are likely to be contiguous while the binary tree pretty much guarantees non-local access.)
If you want generally fast, use the hash. If you really want guaranteed bounded performance, you might go with the binary tree.
I wonder why no one mentioned perfect hashing.
It's only relevant if your dataset is fixed for a long time, but what it does it analyze the data and construct a perfect hash function that ensures no collisions.
Pretty neat, if your data set is constant and the time to calculate the function is small compared to the application run time.
Surprised nobody mentioned Cuckoo hashing, which provides guaranteed O(1) and, unlike perfect hashing, is capable of using all of the memory it allocates, where as perfect hashing can end up with guaranteed O(1) but wasting the greater portion of its allocation. The caveat? Insertion time can be very slow, especially as the number of elements increases, since all of the optimization is performed during the insertion phase.
I believe some version of this is used in router hardware for ip lookups.
See link text
Of course, hash is fastest for such a big dataset.
One way to speed it up even more, since the data seldom changes, is to programmatically generate ad-hoc code to do the first layer of search as a giant switch statement (if your compiler can handle it), and then branch off to search the resulting bucket.
The answer depends. Lets think that the number of elements 'n' is very large. If you are good at writing a better hash function which lesser collisions, then hashing is the best. Note that The hash function is being executed only once at searching and it directs to the corresponding bucket. So it is not a big overhead if n is high.
Problem in Hashtable: But the problem in hash tables is if the hash function is not good (more collisions happens), then the searching isn't O(1). It tends to O(n) because searching in a bucket is a linear search. Can be worst than a binary tree. problem in binary tree: In binary tree, if the tree isn't balanced, it also tends to O(n). For example if you inserted 1,2,3,4,5 to a binary tree that would be more likely a list. So, If you can see a good hashing methodology, use a hashtable If not, you better using a binary tree.