It's clear that a search performance of the generic HashSet<T>
class is higher than of the generic List<T>
class. Just compare the hash-based key with the linear approach in the List<T>
class.
However calculating a hash key may itself take some CPU cycles, so for a small amount of items the linear search can be a real alternative to the HashSet<T>
.
My question: where is the break-even?
To simplify the scenario (and to be fair) let's assume that the List<T>
class uses the element's Equals()
method to identify an item.
The breakeven will depend on the cost of computing the hash. Hash computations can be trivial, or not... :-) There is always the System.Collections.Specialized.HybridDictionary class to help you not have to worry about the breakeven point.
You're looking at this wrong. Yes a linear search of a List will beat a HashSet for a small number of items. But the performance difference usually doesn't matter for collections that small. It's generally the large collections you have to worry about, and that's where you think in terms of Big-O. However, if you've measured a real bottleneck on HashSet performance, then you can try to create a hybrid List/HashSet, but you'll do that by conducting lots of empirical performance tests - not asking questions on SO.
It's essentially pointless to compare two structures for performance that behave differently. Use the structure that conveys the intent. Even if you say your
List<T>
wouldn't have duplicates and iteration order doesn't matter making it comparable to aHashSet<T>
, its still a poor choice to useList<T>
because its relatively less fault tolerant.That said, I will inspect some other aspects of performance,
* Even though addition is O(1) in both cases, it will be relatively slower in HashSet<T> since it involves cost of precomputing hash code before storing it.
** The superior scalability of HashSet<T> has a memory cost. Every entry is stored as a new object along with its hash code.
This article
might give you an idea.
You can use a HybridDictionary which automaticly detects the breaking point, and accepts null-values, making it essentialy the same as a HashSet.
One factor your not taking into account is the robustness of the GetHashcode() function. With a perfect hash function the HashSet will clearly have better searching performance. But as the hash function diminishes so will the HashSet search time.
A lot of people are saying that once you get to the size where speed is actually a concern that
HashSet<T>
will always beatList<T>
, but that depends on what you are doing.Let's say you have a
List<T>
that will only ever have on average 5 items in it. Over a large number of cycles, if a single item is added or removed each cycle, you may well be better off using aList<T>
.I did a test for this on my machine, and, well, it has to be very very small to get an advantage from
List<T>
. For a list of short strings, the advantage went away after size 5, for objects after size 20.Here is that data displayed as a graph:
Here's the code: