High level explanation of Similarity Class for Luc

2019-04-07 02:28发布

Do you know where I can find a high level explanation of Lucene Similarity Class algorithm. I will like to understand it without having to decipher all the math and terms involved with searching and indexing.

3条回答
你好瞎i
2楼-- · 2019-04-07 02:40

Lucene's built-in Similarity is a fairly standard "Inverse Document Frequency" scoring algorithm. The Wikipedia article is brief, but covers the basics. The book Lucene in Action breaks down the Lucene formula in more detail; it doesn't mirror the current Lucene formula perfectly, but all of the main concepts are explained.

Primarily, the score varies with number of times that term occurs in the current document (the term frequency), and inversely with the number of times a term occurs in all documents (the document frequency). The other factors in the formula are secondary, adjusting the score in attempt to make scores from different queries fairly comparable to each other.

查看更多
我想做一个坏孩纸
3楼-- · 2019-04-07 02:45

How was mentioned by erickson in Lucene is Cosine similarity Term Frequency-Inverse document frequency (TF-IDF). Imagine that you have two bags of terms in the query and in the document. This measurement only match exactly terms and after in the context include their semantically weights. Terms with very frequetly occurence has smaller weight (importancy), because you could them find it in lot of documents. But the serious problem what I see that Cosine similarity TF-IDF is not so robust on more inconsistent data, where you need to compute similarity betweens the query and the document more robust e.g. misspeling, typographical and phonetical errors. Because the words must have exact match.

查看更多
The star\"
4楼-- · 2019-04-07 02:53

Think of each document and search term as a vector whose coordinates represent some measure of how important each word in the entire corpus of documents is to that particular document or search term. Similarity tells your the distance between two different vectors.

Say your corpus is normalized to ignore some terms, then a document consisting only of those terms would be located at the origin of a graph of all of your documents in the vector space defined by your corpus. Each document that contains some other terms, then represents a point in the space whose coordinates are defined by the importance of that term in the document relative to that term in the corpus. Two documents (or a document and search) whose coordinates put their "points" closer together are more similar than those with coordinates that put their "points" further apart.

查看更多
登录 后发表回答