How to build a 'related questions' engine?

One of our bigger sites has a section where users can send questions to the website owner which get evaluated personally by his staff. When the same question pops up very often they can add this particular question to the Faq.

In order to prevent them from receiving dozens of similar questions a day we would like to provide a feature similar to the 'Related questions' on this site (stack overflow).

What ways are there to build this kind of feature? I know that i should somehow evaluate the question and compare it to the questions in the faq but how does this comparison work? Are keywords extracted and if so how?

Might be worth mentioning this site is built on the LAMP stack thus these are the technologies available.

Thanks!

标签： php mysql lamp recommendation-engine

5条回答

老娘就宠你

2楼-- · 2019-01-18 14:54

If you wanted to build something like this yourself from scratch, you'd use something called TF/IDF: Term Frequency / Inverse document frequency. That means, to simplify it enormously, you find words in the query that are uncommon in the corpus as a whole and find documents that have those words.

In other words, if someone enters a query with the words "I want to buy an elephant" in it, then of the words in the query, the word "elephant" is probably the least common word in your corpus. "Buy" is probably next. So you rank documents (in your case, previous queries) by how much they contain the word "elephant" and then how much they contain the word "buy". The words "I", "to" and "an" are probably in a stop-list, so you ignore them altogether. You rank each document (previous query, in your case) by how many matching words there are (weighting according to inverse document frequency -- i.e. high weight for uncommon words) and show the top few.

I've oversimplified, and you'd need to read up on this to get it right, but it's really not terribly complicated to implement in a simple way. The Wikipedia page might be a good place to start:

http://en.wikipedia.org/wiki/Tf%E2%80%93idf

0人赞添加讨论(0) 举报

Fickle 薄情

3楼-- · 2019-01-18 15:06

I don't know how Stack Overflow works, but I guess that it uses the tags to find related questions. For example, on this question the top few related questions all have the tag recommendation-engine. I would guess that the matches on rarer tags count for more than matches on common tags.

You might also want to look at term frequency–inverse document frequency.

0人赞添加讨论(0) 举报

Deceive 欺骗

4楼-- · 2019-01-18 15:07

Given you're working in a LAMP stack, then you should be able to make good use of MySQL's Fulltext search functions. Which I believe work on the TF-IDF principals, and should make it pretty easy to create the 'related questions' that you want.

0人赞添加讨论(0) 举报

SAY GOODBYE

5楼-- · 2019-01-18 15:17

You can use spell-checking, where the corpus is the titles/text of the existing FAQ entries:

How do you implement a "Did you mean"?

0人赞添加讨论(0) 举报

▲ chillily

6楼-- · 2019-01-18 15:18

There's a great O'Reilly book - Programming Collective Intelligence - which covers group discovery, recommendations and other similar topics. From memory the examples are in Perl, but I found it easy to understand coming from a PHP background and within a few hours had built something akin to what you're after.

Yahoo has a keyword extractor webservice at http://developer.yahoo.com/search/content/V1/termExtraction.html

0人赞添加讨论(0) 举报

How to build a 'related questions' engine?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间