Fast Dynamic Fuzzy search over 100k+ strings in C#

2019-04-07 13:00发布

问题:

Let's say they are pre-loaded stock symbols, typed into a text box. I am looking for code that I can copy, not a library to install.

This was inspired by this question:

Are there any Fuzzy Search or String Similarity Functions libraries written for C#?

The Levenstein distance algorithm seems to work well, but it takes time to compute. Are there any optimizations around the fact that the query will need to re-run as the user types in an extra letter? I am interested in showing at most the top 10 matches for each input.

回答1:

You need to determine the matching rules around your strings. What determines a 'similar string'

  • number of matching characters
  • number of non-matching characters
  • similar length
  • typos or phonetic errors
  • business specific abbreviations
  • must start with the same substring
  • must end with the same substring

I've done quite a lot of work with string matching algorithms, and am yet to find any existing library or code that meets my specific requirements. Review them, borrow ideas from them, but you will invariably have to customize and write your own code.

The Levenstein algorithm is good but a bit slow. I've had some success with both Smith-Waterman & Jaro-Winkler algorithms, but the best I found for my purpose was Monge (from memory). However it pays to read the original research and determine why they've written their algorithms and their target dataset.

If you don't properly define what you want to match and measure then you'll find high scores on unexpected matches and low scores on expected matches. String matching is very domain specific. If you don't properly define your domain then you are like a fisherman without a clue, throwing hooks around and hoping for the best.



回答2:

This blog post describes some work that went into Lucene in this area. They were able to implement Levenshtein distance fuzzy matching using a Finite State Transducer (automaton) quite efficiently for up to an edit distance of 2. The code is all in Java though, and a little complex, although it is open source.

But the basic idea is simple enough: think of your dictionary as a giant tree of letter-states. At state0, you have no letters. At state1, you admit any letter that could be the first letter of a word. State2 is conditioned by state1; if the first letter was 'x', the next state admits only letters that could follow x (in position 2). etc.

Now for Levenshtein matching, you traverse the letter-tree while allowing some errors: deletions, insertions (one letter wildcard), and possibly transposal (a nice extension of Levenshtein is to consider a transposal as a single edit rather than 2). You have to maintain state in order to keep track of how many edits have been allowed. This can be done very efficiently - certainly fast enough for an interactive "while you type" spelling suggester.