Let's say that I need to make a mapping from String
to an integer. The integers are unique and form a continuous range starting from 0. That is:
Hello -> 0
World -> 1
Foo -> 2
Bar -> 3
Spam -> 4
Eggs -> 5
etc.
There are at least two straightforward ways to do it. With a hashmap:
HashMap<String, Integer> map = ...
int integer = map.get(string); // Plus maybe null check to avoid NPE in unboxing.
Or with a list:
List<String> list = ...
int integer = list.indexOf(string); // Plus maybe check for -1.
Which approach should I use, and why? Arguably the relative performance depends on the size of the list/map, since List#indexOf()
is a linear search using String#equals()
-> O(n) efficiency, while HashMap#get()
uses hash to narrow down the search -> certainly more efficient when the map is big, but maybe inferior when there are just few elements (there must be some overhead in calculating the hash, right?).
Since benchmarking Java code properly is notoriously hard, I would like to get some educated guesses. Is my reasoning above correct (list is better for small, map is better for large)? What is the threshold size approximately? What difference do various List
and HashMap
implementations make?
Your question is totally correct on all points:
HashMap
s are better (they use a hash)But at the end of the day, you're just going to have to benchmark your particular application. I don't see why HashMaps would be slower for small cases but the benchmarking will give you the answer if it is or not.
One more option, a
TreeMap
is another map data structure which uses a tree as opposed to a hash to access the entries. If you are doing benchmarking, you might as well benchmark that as well.Regarding benchmarking, one of the main problems is the garbage collector. However if you do a test which doesn't allocate any objects, that shouldn't be a problem. Fill up your map/list, then just write a loop to get N random elements, and then time it, that should be reasonably reproducable and therefore informative.
You're right: a List would be O(n), a HashMap would be O(1), so a HashMap would be faster for n large enough so that the time to calculate the hash didn't swamp the List linear search.
I don't know the threshold size; that's a matter for experimentation or better analytics than I can muster right now.
I think a
HashMap
will always be better. If you haven
strings each of length at mostl
, thenString#hashCode
andString#equals
are bothO(l)
(in Java's default implementation, anyway).When you do
List#indexOf
it iterates through the list (O(n)
) and performs a comparison on each element (O(l)
), to giveO(nl)
performance.Java's
HashMap
has (let's say)r
buckets, and each bucket contains a linked list. Each of these lists is of lengthO(n/r)
(assuming the String'shashCode
method distributes the Strings uniformly between the buckets). To look up a String, you need to calculate thehashCode
(O(l)
), look up the bucket (O(1)
- one, notl
), and iterate through that bucket's linked list (O(n/r)
elements) doing anO(l)
comparison on each one. This gives a total lookup time ofO(l + (nl)/r)
.As the List implementation is
O(nl)
and the HashMap implementation isO(nl/r)
(I'm dropping the firstl
as it's relatively insignificant), lookup performance should be equivalent whenr=1
and the HashMap will be faster for all greater values ofr
.Note that you can set
r
when you construct theHashMap
using this constructor (set theinitialCapacity
tor
and theloadFactor
argument ton/r
for your givenn
and chosenr
).Unfortunately, you are going to have to benchmark this yourself, because the relative performance will depend critically on the actual String values, and also on the relative probability that you will test a string that is not in your mapping. And of course, it depends on how
String.equals()
andString.hashCode()
are implemented, as well as the details of theHashMap
andList
classes used.In the case of a
HashMap
, a lookup will typically involve calculating the hash of the key String, and then comparing the key String with one or more entry key Strings. The hashcode calculation looks at all characters of the String, and is therefore dependent on the key String. Theequals
operations typically will typically examine all of the characters whenequals
returnstrue
and considerably less when it returnsfalse
. The actual number of times thatequals
is called for a given key string depends on how the hashed key strings are distributed. Normally, you'd expect an average of 1 or 2 calls to equal for a "hit" and maybe up to 3 for a "miss".In the case of a
List
, a lookup will callequals
for an average of half the entry key Strings in the case of a "hit" and all of them in the case of a "miss". If you know the relative distribution of the keys that you are looking up, you can improve the performance in the "hit" case by ordering the list. But the "miss" case cannot be optimized.In addition to the trie alternative suggested by @aioobe, you could also implement a specialized String to integer hashmap using a so-called perfect hash function. This maps each of the actual key strings to a unique hash within a small range. The hash can then be used to index an array of key/value pairs. This reduces a lookup to exactly one call to hash function and one call to
String.equals
. (And if you can assume that supplied key will always be one of the mapped strings, you can dispense with the call toequals
.)The difficulty of the perfect hash approach is in finding a function that works for the set of keys in the mapping and is not too expensive to compute. AFAIK, this has to be done by trial and error.
But the reality is that simply using a
HashMap
is a safe option, because it givesO(1)
performance with a relatively small constant of proportionality (unless the entry keys are pathological).(FWIW, my guess is that the break-even point where
HashMap.get()
becomes better thanList.contains()
is less than10
entries, assuming that the strings have an average length of5
to10
.)A third option and possibly my favorite would be to use a trie:
I bet it beats the
HashMap
in performance (no collisions + the fact that computing the hash-code isO(length of string)
anyway), and possibly also theList
approach in some cases (such as if your strings have long common prefixes, as the indexOf would waste lot of time in theequals
methods).When choosing between List and Map I would go for a
Map
(such asHashMap
). Here is my reasoning:Readability
The Map interface simply provides a more intuitive interface for this use case.
Optimization in the right place
I'd say if you're using a
List
you would be optimizing for the small cases anyway. That's probably not where the bottle neck is.A fourth option would be to use a
LinkedHashMap
, iterate through it if the size is small, andget
the associated number if the size is large.A fifth option is to encapsulate the decision in a separate class all together. In this case you could even implement it to change strategy in runtime as the list grows.
From what I can remember, the list method will be O(n),but would be quick to add items, as no computation occurs. You could get this lower O(log n) if you implemented a b-search or other searching algorithms. The hash is O(1), but its slower to insert, since the hash needs to be computed every time you add an element.
I know in .net, theres a special collection called a HybridDictionary, that does exactly this. Uses a list to a point, then a hash. I think the crossover is around 10, so this may be a good line in the sand.
I would say you're correct in your above statement, though I'm not 100% sure if a list would be faster for small sets, and where the crossover point is.