How does the Google “Did you mean?” Algorithm work

2018-12-31 22:48发布

I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly respond to queries with "Did you mean: xxxx".

I need to be able to intelligently take a user query and respond with not only raw search results but also with a "Did you mean?" response when there is a highly likely alternative answer etc

[I'm developing in ASP.NET (VB - don't hold it against me! )]

UPDATE: OK, how can I mimic this without the millions of 'unpaid users'?

  • Generate typos for each 'known' or 'correct' term and perform lookups?
  • Some other more elegant method?

18条回答
还给你的自由
2楼-- · 2018-12-31 23:41

You mean to say spell checker? If it is a spell checker rather than a whole phrase then I've got a link about the spell checking where the algorithm is developed in python. Check this link

Meanwhile, I am also working on project that includes searching databases using text. I guess this would solve your problem

查看更多
旧人旧事旧时光
3楼-- · 2018-12-31 23:44

Simple. They have tons of data. They have statistics for every possible term, based on how often it is queried, and what variations of it usually yield results the users click... so, when they see you typed a frequent misspelling for a search term, they go ahead and propose the more usual answer.

Actually, if the misspelling is in effect the most frequent searched term, the algorythm will take it for the right one.

查看更多
无色无味的生活
4楼-- · 2018-12-31 23:46

Use Levenshtein distance, then create a Metric Tree (or Slim tree) to index words. Then run a 1-Nearest Neighbour query, and you got the result.

查看更多
回忆,回不去的记忆
5楼-- · 2018-12-31 23:46

Apart from the above answers, in case you want to implement something by yourself quickly, here is a suggestion -

Algorithm

You can find the implementation and detailed documentation of this algorithm on GitHub.

  • Create a Priority Queue with a comparator.
  • Create a Ternay Search Tree and insert all english words (from Norvig's post) along with their frequencies.
  • Start traversing the TST and for every word encountered in TST, calculate its Levenshtein Distance(LD) from input_word
  • If LD ≤ 3 then put it in a Priority Queue.
  • At Last extract 10 words from the Priority Queue and display.
查看更多
步步皆殇っ
6楼-- · 2018-12-31 23:49

regarding your question how to mimic the behavior without having tons of data - why not use tons of data collected by google? Download the google sarch results for the misspelled word and search for "Did you mean:" in the HTML.

I guess that's called mashup nowadays :-)

查看更多
浮光初槿花落
7楼-- · 2018-12-31 23:50

This is an old question, and I'm surprised that nobody suggested the OP using Apache Solr.

Apache Solr is a full text search engine that besides many other functionality also provides spellchecking or query suggestions. From the documentation:

By default, the Lucene Spell checkers sort suggestions first by the score from the string distance calculation and second by the frequency (if available) of the suggestion in the index.

查看更多
登录 后发表回答