How to build a 'related questions' engine?

2019-01-18 14:44发布

问题:

One of our bigger sites has a section where users can send questions to the website owner which get evaluated personally by his staff. When the same question pops up very often they can add this particular question to the Faq.

In order to prevent them from receiving dozens of similar questions a day we would like to provide a feature similar to the 'Related questions' on this site (stack overflow).

What ways are there to build this kind of feature? I know that i should somehow evaluate the question and compare it to the questions in the faq but how does this comparison work? Are keywords extracted and if so how?

Might be worth mentioning this site is built on the LAMP stack thus these are the technologies available.

Thanks!

回答1:

I don't know how Stack Overflow works, but I guess that it uses the tags to find related questions. For example, on this question the top few related questions all have the tag recommendation-engine. I would guess that the matches on rarer tags count for more than matches on common tags.

You might also want to look at term frequency–inverse document frequency.



回答2:

If you wanted to build something like this yourself from scratch, you'd use something called TF/IDF: Term Frequency / Inverse document frequency. That means, to simplify it enormously, you find words in the query that are uncommon in the corpus as a whole and find documents that have those words.

In other words, if someone enters a query with the words "I want to buy an elephant" in it, then of the words in the query, the word "elephant" is probably the least common word in your corpus. "Buy" is probably next. So you rank documents (in your case, previous queries) by how much they contain the word "elephant" and then how much they contain the word "buy". The words "I", "to" and "an" are probably in a stop-list, so you ignore them altogether. You rank each document (previous query, in your case) by how many matching words there are (weighting according to inverse document frequency -- i.e. high weight for uncommon words) and show the top few.

I've oversimplified, and you'd need to read up on this to get it right, but it's really not terribly complicated to implement in a simple way. The Wikipedia page might be a good place to start:

http://en.wikipedia.org/wiki/Tf%E2%80%93idf



回答3:

Given you're working in a LAMP stack, then you should be able to make good use of MySQL's Fulltext search functions. Which I believe work on the TF-IDF principals, and should make it pretty easy to create the 'related questions' that you want.



回答4:

There's a great O'Reilly book - Programming Collective Intelligence - which covers group discovery, recommendations and other similar topics. From memory the examples are in Perl, but I found it easy to understand coming from a PHP background and within a few hours had built something akin to what you're after.

Yahoo has a keyword extractor webservice at http://developer.yahoo.com/search/content/V1/termExtraction.html



回答5:

You can use spell-checking, where the corpus is the titles/text of the existing FAQ entries:

How do you implement a "Did you mean"?