Search ranking/relevance algorithms

2019-02-11 15:18发布

问题:

When developing a database of articles in a Knowledge Base (for example) - what are the best ways to sort and display the most relevant answers to a users' question?

Would you use additional data such as keyword weighting based on whether previous users found the article of help, or do you find a simple keyword matching algorithm to be sufficient?

回答1:

Perhaps the easiest and most naive approach that will give immediately useful results would be to implement *tf-idf:

Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

In a recent related question of mine here I learned of an excellent free book on this topic which you can download or read online:

An Introduction to Information Retrieval



回答2:

That's a hard question, and companies like Google are pushing a lot of efforts to address this question. Have a look at Google Enterprise Search Appliance or Exalead Enterprise Search.

Then, as a personal opinion, I don't think that any "naive" approach is going to improve much the result compared to naive keyword search and ordering by the number of views on the documents.

If you have the possibility to expose your knowledge base to the web, then, just do it, and let your favorite search engine handles the search for you.



回答3:

A little more specificity of your exact problem would be good. There are a lot of different techniques that you can use. Many of these are driven by other pieces of data. You can of course use Lucene and build your own indexes. There are bindings for many languages to lucene. Moving up there is also the Solr project which is Lucene with a lot of tools and extra functionality around it. That may be more along the lines of what you are looking for.

Intent is tricky and most modern search engines rely on statistical intent to aid in the ordering of results. You can always have an is this article useful button and store the query text that leads to useful documents. You could then add a layer of information to the index to boost specific words or phrases and help them point to certain documents.

Some things to think about...How many documents? What is the average length? Are they updated frequently? What do users do with the documents? What does the spread of unique words to documents look like? (More simply is it easy to match a query with a specific document(s) based on common unique features.)

If it is on the web you can always make a google custom search engine that just searches your site although you may find this to be sub-optimal for a variety of reasons.

You can always start with a simple index and gradually make it more sophisticated by talking with users and capturing data.



回答4:

I think the angle here is not the retrieval itself... its about scoring the relevence of the information retrieved (A more reactive and passive approach) which can be later used to improve the search engine.

I guess you can try -

  1. knn on tfidf for retrieving information

  2. Hand tagging these retrieved info a relevency score

  3. Then regress that score to predict the score for an unknwon search result and sort it.

Just a thought...

The third point is actually based on Rocchio algorithm. You can see it here



回答5:

keyword matching is not enough when dealing with questions, you need to understand intent, as joannes say a very hot topic in search