I have a large set of Words and Phrases (a lexicon or dictionary) which includes wildcards. I need to find all instances of those Words and Phrases within a much smaller String (~150 characters at the moment).
Initially, I wanted to run the operation in reverse; that is to check to see if each word in my smaller string exists within the Lexicon, which could be implemented as a Hash Table. The problem is that some of these values in my Lexicon are not single words and many are wildcards (e.g. substri*).
I'm thinking of using the Rabin-Karp algorithm but I'm not sure this is the best choice.
What's an efficient algorithm or method to perform this operation?
Sample Data:
The dictionary contains hundreds of words and can potentially expand. These words may end with wildcard characters (asterisks). Here are some random examples:
- good
- bad
- freed*
- careless*
- great loss
The text we are analyzing (at this point) are short, informal (grammar-wise) English statements. The prime example of text (again, at this point in time) would be a Twitter Tweet. These are limited to roughly 140 Characters. For example:
Just got the Google nexus without a contract. Hands down its the best phone
I've ever had and the only thing that could've followed my N900.
While it may be helpful to note that we are performing very simple Sentiment Analysis on this text; our Sentiment Analysis technique is not my concern. I'm simply migrating an existing solution to a "real-time" processing system and need to perform some optimizations.
I think this is an excellent use case for the Aho-Corasick string-matching algorithm, which is specifically designed to find all matches of a large set of strings in a single string. It runs in two phases - a first phase in which a matching automaton is created (which can be done in advance and requires only linear time), and a second phase in which the automaton is used to find all matches (which requires only linear time, plus time proportional to the total number of matches). The algorithm can also be adapted to support wildcard searching as well.
Hope this helps!
I had a very similar task. Here's how I solved it, the performance is unbelievable http://www.foibg.com/ijita/vol17/ijita17-2-p03.pdf
This doesn't answer the algorithm question exactly, but check out the re2 library. There are good interfaces in Python, Ruby and assorted other programming languages. In my experience, it's blindly fast, and removed quite similar bottlenecks in my code with little fuss and virtually no extra code on my part.
The only complexity comes with overlapping patterns. If you want the patterns to start at word boundaries, you should be able partition the dictionary into a set of regular expressions
r_1, r_2, ..., r_k
of the form\b(foobar|baz baz\S*|...)
where each group inr_{i+1}
has a prefix inr_i
. You can then short circuit evaluations since ifr_{i+1}
matches thenr_i
must have matched.Unless you're implementing your algorithm in highly-optimized C, I'd bet that this approach will be faster than any of the (algorithmically superior) answers elsewhere in this thread.
So long as the patterns are for complete words: you don't want
or
to matchstorage
; spaces and punctuation are match anchors, then a simple way to go is translate your lexicon into a scanner generator input (for example you could use flex), generate a scanner, and then run it over your input.Scanner generators are designed to identify occurrences of tokens in input, where each token type is described by a regular expression. Flex and similar programs create scanners rapidly. Flex handles up to 8k rules (in your case lexicon entries) by default, and this can be expanded. The generated scanners run in linear time and in practice are very fast.
Internally, the token regular expressions are transformed in a standard "Kleene's theorem" pipeline: First to a NFA, then to a DFA. The DFA is then transformed to its unique minimum form. This is encoded in a HLL table, which is emitted inside a wrapper that implements the scanner by referring to the table. This is what flex does, but other strategies are possible. For example the DFA can be translated to
goto
code where the DFA state is represented implicitly by the instruction pointer as the code runs.The reason for the spaces-as-anchors caveat is that scanners created by programs like Flex are normally incapable of identifying overlapping matches:
strangers
could not match bothstrangers
andrange
, for example.Here is a flex scanner that matches the example lexicon you gave:
And to run it:
One answer I wanted to throw out there was the Boyer-Moore search algorithm. It is the algorithm that grep uses. Grep is probably one of the fastest search tools available. In addition, you could use something like GNU Parallel to have grep run in parallel, thus really speeding up the algorithm.
In addition, here is a good article that you might find interesting.
You might still use your original idea, of checking every word in the text against the dictionary. However, in order to run it efficently you need to index the dictionary, to make lookups really fast. A trick used in information retrival systems is to store a so called permuterm index (http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html).
Basically what you want to do is to store in a dictionary every possible permutation of the words (e.g. for house):
This index can then be used to quickly check against wildcard queries. If for example you have ho*e you can look in the permuterm index for a term beginning with
e$ho
and you would quickly find a match with house.The search itself is usually done with some logarithmic search strategy (binary search or b-trees) and thus is usally very fast.