I seek a state of the art algorithms to approximate string matching. Do you offer me references(article, thesis,...)? thank you
相关问题
- how to split a list into a given number of sub-lis
- Generate string from integer with arbitrary base i
- Converting a string array to a byte array
- How to convert a string to a byte array which is c
- format ’%s’ expects argument of type ’char *’
相关文章
- JSP String formatting Truncate
- Selecting only the first few characters in a strin
- Python: print in two columns
- extending c++ string member functions
- Google app engine datastore string encoding proble
- How to measure complexity of a string?
- What is the limit in size of a Php variable when s
- boost split with a single character or just one st
You might have got your answer already but I want to convey my points on approximate string matching so that, others might benefit. I am speaking from my experience having worked to solve cloud service problems to handle really large scale requirements.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
Assuming that we have a list of millions of names and if we want to search a given input name against all the entries in the list using one of the standard algorithms above would mean disaster.
A typical, edit distance algorithm has a time complexity of O(N^2) where N is the number of characters in a string. To scan the list of size M, the complexity would be O(M * N^2). This would mean very high hardware requirements and it just doesn't work in your favor regardless of how much h/w you want to stack up.
This is where we have to start thinking about other approaches. One of the common approaches to solve such a problem in the production environment is to use a standard search engine like - Apache Lucene.
https://lucene.apache.org/
Lucene indexing engine indexes the reference data(Called as documents) and input query can be fired against the engine. The results are returned which are ranked based on how close they are to the input. This is close to how google search engine works. Googles crawles and index the whole web but you should have a miniature system mimicking what Google does.
This works for most of the cases including complicated name matching where the first, middle and the last names are interchanged.
You can select your results based on the scores emitted by Lucene.
While you mature in your role, you will start thinking about using hosted solutions like Amazon Cloudsearch which wraps the Solr and ElastiSearch for you. Of-course it uses Lucene underneath and keeps you independent of potential size of the index due to larger reference data which is used for indexing.
http://aws.amazon.com/cloudsearch/
You might want to read about Levenshtein distance.
http://en.wikipedia.org/wiki/Levenshtein_distance