Say I have a Set of Strings:
Set<String> things = new HashSet<String>();
things.add("coffee cup");
things.add("smartphone");
things.add("inkjet printer");
// :
// list could be quite large (100K or so, perhaps loaded from a database)
// :
Now I want to check if another string completely contains any of the Strings in the above set. So:
"a coffee cup" - matches
"android smartphone" - matches
"inkjet printer for sale" - matches
"laser printer" - does not match
"printer" - does not match
The only way I can think of is iterating through the set (and break-ing if found). Is there a more efficient and elegant way to do this?
I built off of @meriton's suggestion. Instead of every possible substring combination, I'm going to do every possible word combination.
If I run "inkjet printer for sale" through the code snippet above, I get:
Then I can do a simple
contains()
on the original set of words.Iterate over all substrings of the candidate, and check whether the set contains them?
Yes, a string of length k has k^2 substrings, but that may still be far less than the number of strings in the set ...
You need Aho-Corasick algorithm. http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
https://github.com/raymanrt/aho-corasick
Time complexity is O(m) for preprocessing (where m is total length of strings in the set) and O(n) for matching (where n is length of matched string). So it's asymptotically optimal.