Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?
Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>
" like Google does in some search queries?
Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?
Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>
" like Google does in some search queries?
I would suggest looking at SOUNDEX to find similar words in your database.
You can also access google own dictionary by using the Google API spelling suggestion request.
Soundex and "Porter stemming" (soundex is trivial, not sure about porter stemming).
Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.
There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.
Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.
Soundex - in my experience - is crap.
Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.
There's something called aspell that might help: http://blog.evanweaver.com/files/doc/fauna/raspell/classes/Aspell.html
There's a ruby gem for it, but I don't know how to talk to it from python http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html
Here's a quote from the ruby implementation
This outputs:
Possible correction for haert: heart Possible correction for wil: Will
You may want to look at Peter Norvig's "How to Write a Spelling Corrector" article.
If you have industry specific translations, you will likely need a thesaurus. For example, I worked in the jewelry industry and there were abbreviate in our descriptions such as kt - karat, rd - round, cwt - carat weight... Endeca (the search engine at that job) has a thesaurus that will translate from common misspellings, but it does require manual intervention.