A StringToken Parser which gives Google Search sty

2019-02-02 01:51发布

Seeking a method to:

Take whitespace separated tokens in a String; return a suggested Word


ie:
Google Search can take "fonetic wrd nterpreterr",
and atop of the result page it shows "Did you mean: phonetic word interpreter"

A solution in any of the C* languages or Java would be preferred.


Are there any existing Open Libraries which perform such functionality?

Or is there a way to Utilise a Google API to request a suggested word?

8条回答
劫难
2楼-- · 2019-02-02 02:09

Since no one has yet mentioned it, I'll give one more phrase to search for: "edit distance" (for example, link text). That can be used to find closest matches, assuming it's typos where letters are transposed, missing or added.

But usually this is also coupled with some sort of relevancy information; either by simple popularity (to assume most commonly used close-enough match is most likely correct word), or by contextual likelihood (words that follow preceding correct word, or come before one). This gets into information retrieval; one way to start is to look at bigram and trigrams (sequences of words seen together). Google has very extensive freely available data sets for these.

For simple initial solution though a dictionary couple with Levenshtein-based matchers works surprisingly well.

查看更多
相关推荐>>
3楼-- · 2019-02-02 02:20

In his article How to Write a Spelling Corrector, Peter Norvig discusses how a Google-like spellchecker could be implemented. The article contains a 20-line implementation in Python, as well as links to several reimplementations in C, C++, C# and Java. Here is an excerpt:

The full details of an industrial-strength spell corrector like Google's would be more confusing than enlightening, but I figured that on the plane flight home, in less than a page of code, I could write a toy spelling corrector that achieves 80 or 90% accuracy at a processing speed of at least 10 words per second.

Using Norvig's code and this text as training set, i get the following results:

>>> import spellch
>>> [spellch.correct(w) for w in 'fonetic wrd nterpreterr'.split()]
['phonetic', 'word', 'interpreters']
查看更多
登录 后发表回答