Finding all matches from a large set of Strings in

2020-06-21 10:28发布

问题:

I have a large set of Words and Phrases (a lexicon or dictionary) which includes wildcards. I need to find all instances of those Words and Phrases within a much smaller String (~150 characters at the moment).

Initially, I wanted to run the operation in reverse; that is to check to see if each word in my smaller string exists within the Lexicon, which could be implemented as a Hash Table. The problem is that some of these values in my Lexicon are not single words and many are wildcards (e.g. substri*).

I'm thinking of using the Rabin-Karp algorithm but I'm not sure this is the best choice.

What's an efficient algorithm or method to perform this operation?

Sample Data:

The dictionary contains hundreds of words and can potentially expand. These words may end with wildcard characters (asterisks). Here are some random examples:

  • good
  • bad
  • freed*
  • careless*
  • great loss

The text we are analyzing (at this point) are short, informal (grammar-wise) English statements. The prime example of text (again, at this point in time) would be a Twitter Tweet. These are limited to roughly 140 Characters. For example:

Just got the Google nexus without a contract. Hands down its the best phone 
I've ever had and the only thing that could've followed my N900.

While it may be helpful to note that we are performing very simple Sentiment Analysis on this text; our Sentiment Analysis technique is not my concern. I'm simply migrating an existing solution to a "real-time" processing system and need to perform some optimizations.

回答1:

I think this is an excellent use case for the Aho-Corasick string-matching algorithm, which is specifically designed to find all matches of a large set of strings in a single string. It runs in two phases - a first phase in which a matching automaton is created (which can be done in advance and requires only linear time), and a second phase in which the automaton is used to find all matches (which requires only linear time, plus time proportional to the total number of matches). The algorithm can also be adapted to support wildcard searching as well.

Hope this helps!



回答2:

One answer I wanted to throw out there was the Boyer-Moore search algorithm. It is the algorithm that grep uses. Grep is probably one of the fastest search tools available. In addition, you could use something like GNU Parallel to have grep run in parallel, thus really speeding up the algorithm.

In addition, here is a good article that you might find interesting.



回答3:

You might still use your original idea, of checking every word in the text against the dictionary. However, in order to run it efficently you need to index the dictionary, to make lookups really fast. A trick used in information retrival systems is to store a so called permuterm index (http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html).

Basically what you want to do is to store in a dictionary every possible permutation of the words (e.g. for house):

house$
ouse$h
use$ho
...
e$hous

This index can then be used to quickly check against wildcard queries. If for example you have ho*e you can look in the permuterm index for a term beginning with e$ho and you would quickly find a match with house.

The search itself is usually done with some logarithmic search strategy (binary search or b-trees) and thus is usally very fast.



回答4:

So long as the patterns are for complete words: you don't want or to match storage; spaces and punctuation are match anchors, then a simple way to go is translate your lexicon into a scanner generator input (for example you could use flex), generate a scanner, and then run it over your input.

Scanner generators are designed to identify occurrences of tokens in input, where each token type is described by a regular expression. Flex and similar programs create scanners rapidly. Flex handles up to 8k rules (in your case lexicon entries) by default, and this can be expanded. The generated scanners run in linear time and in practice are very fast.

Internally, the token regular expressions are transformed in a standard "Kleene's theorem" pipeline: First to a NFA, then to a DFA. The DFA is then transformed to its unique minimum form. This is encoded in a HLL table, which is emitted inside a wrapper that implements the scanner by referring to the table. This is what flex does, but other strategies are possible. For example the DFA can be translated to goto code where the DFA state is represented implicitly by the instruction pointer as the code runs.

The reason for the spaces-as-anchors caveat is that scanners created by programs like Flex are normally incapable of identifying overlapping matches: strangers could not match both strangers and range, for example.

Here is a flex scanner that matches the example lexicon you gave:

%option noyywrap
%%
"good"                    { return 1; }
"bad"                     { return 2; }
"freed"[[:alpha:]]*       { return 3; }
"careless"[[:alpha:]]*    { return 4; }
"great"[[:space:]]+"loss" { return 5; }
.                         { /* match anything else except newline */ }
"\n"                      { /* match newline */ }
<<EOF>>                   { return -1; }
%%
int main(int argc, char *argv[])
{
  yyin = argc > 1 ? fopen(argv[1], "r") : stdin;
  for (;;) {
    int found = yylex();
    if (found < 0) return 0;
    printf("matched pattern %d with '%s'\n", found, yytext);
  }
}

And to run it:

$ flex -i foo.l
$ gcc lex.yy.c
$ ./a.out
Good men can only lose freedom to bad
matched pattern 1 with 'Good'
matched pattern 3 with 'freedom'
matched pattern 2 with 'bad'
through carelessness or apathy.
matched pattern 4 with 'carelessness'


回答5:

This doesn't answer the algorithm question exactly, but check out the re2 library. There are good interfaces in Python, Ruby and assorted other programming languages. In my experience, it's blindly fast, and removed quite similar bottlenecks in my code with little fuss and virtually no extra code on my part.

The only complexity comes with overlapping patterns. If you want the patterns to start at word boundaries, you should be able partition the dictionary into a set of regular expressions r_1, r_2, ..., r_k of the form \b(foobar|baz baz\S*|...) where each group in r_{i+1} has a prefix in r_i. You can then short circuit evaluations since if r_{i+1} matches then r_i must have matched.

Unless you're implementing your algorithm in highly-optimized C, I'd bet that this approach will be faster than any of the (algorithmically superior) answers elsewhere in this thread.



回答6:

Let me get this straight. You have a large set of queries and one small string and you want to find instances of all those queries in that string.

In that case I suggest you index that small document like crazy so that your search time is as short as possible. Hell. With that document size I would even consider doing small mutations (to match wildcards and so on) and would index them as well.



回答7:

I had a very similar task. Here's how I solved it, the performance is unbelievable http://www.foibg.com/ijita/vol17/ijita17-2-p03.pdf