I have a large collection of Strings. I want to be able to find the Strings that begin with "Foo" or the Strings that end with "Bar". What would be the best Collection type to get the fastest results? (I am using Java)
I know that a HashSet is very fast for complete matches, but not for partial matches I would think? So, what could I use instead of just looping through a List? Should I look into LinkedList's or similar types? Are there any Collection Types that are optimized for this kind of queries?
If the list of words is stable (not many words are added or deleted), a very good second alternative is to create 2 lists:
For speed purposes, make them
ArrayList
s. NeverLinkedList
s or other variants which perform extremely bad on random access (the core of binary search; see below).After the lists are created, they can be sorted with method
Collections.sort
(only once each) and then searched withCollections.binarySearch
. For example:And then to search for words starting in "Foo":
And words ending in "Bar":
The best collection type for this problem is
SortedSet
. You would need two of them in fact:Once these
SortedSet
s have been created, you can use methodsubSet
to find what you are looking for. For example:Words starting with
"Foo"
:Words ending with
"Bar"
:The reason we are "adding" 1 to the last search character is to obtain the whole range. The "ending" word is excluded from the
subSet
, so there is no problem.EDIT: Of the two concrete classes that implement
SortedSet
in the standard Java library, useTreeSet
. The other (ConcurrentSkipListSet
) is oriented to concurrent programs and thus not optimized for this situation.It's been a while but I needed to implement this now and did some testing.
I already have a
HashSet<String>
as source so generation of all other datastructures is included in search time. 100 different sources are used and each time the data structures need to be regenerated. I only need to match a few single Strings each time. These tests ran on Android.Methods:
Simple loop through
HashSet
and callendsWith()
on each stringSimple loop through
HashSet
and perform precompiledPattern
match (regex) on each string.Convert
HashSet
to singleString
joined by\n
and single match on whole String.Generate
SortedTree
with reversedStrings
fromHashSet
. Then match withsubset()
as explained by @Mario Rossi.Results:
Conclusion:
SortedSet
/SortedTree
is extremely fast in searching. Much faster than just looping through allStrings
. However, creating the structure takes a lot of time. Regexes are much slower, but generating a single largeString
out of hundreds ofStrings
is more of a bottleneck on Android/Java.If only a few matches need to be made, then you better loop through your collection. If you have much more matches to make it may be very useful to use a
SortedTree
!