What are norms in Lucene

I don't understand what they are, and would really appreciate a simple explanation showing what value they bring to the world without too much implementation detail of how they work.

标签： lucene

2条回答

倾城　Initia

2楼-- · 2019-02-21 10:25

When you index, process, your source information you will treat some documents and fields as more important than others.

For example, the task is to spy on your colleagues' emails. A word match in the title field is more important than a word match in the body field. We do this by multiplying the number of matches in the title field by a number larger than we use for body field matches.

Example Indexable Email Records

+----+-------------+--------------+
| ID | Title       | Body         |
|----+-------------+--------------|
| 7  | Back Monday | Ben was sick |
| 8  | I'm sick    | cover for me |
| 9  | Help        | I am stuck   |
+----+-------------+--------------+

So, searching for 'sick' and multiplying a title match by 4 and body match by 2 and ordering highest score first - the documents are ranked ID 9 first and ID 8 second (see table 1 below).

Table 1: Matches for the word 'sick' ordered by score (descending)

+----+---------+--------+-----------------------+
| Id | Title   | Body   | Score                 |
|    | Matches | Matches|                       |
|----+---------+--------+-----------------------|
| 9  | 1       | 0      | (1 * 4) + (0 * 2) = 4 |
| 8  | 0       | 1      | (0 * 4) + (1 * 2) = 2 |
+----+---------+--------+-----------------------+

These numbers, 4 and 2, we are multiplying the matches with are the norms.

0人赞添加讨论(0) 举报

男人必须洒脱

3楼-- · 2019-02-21 10:29

A norm is part of the calculation of a score. The norm could be calculated however you like, really. The main thing that sets the norm apart, is it's calculated at index-time. Generally, other factors influencing score are calculated at query time, based on how well the document matches the query. The norm saves on query performance by being stored along with the document, instead.

The standard implementation can be found, and well described, in Lucene's TFIDFSimilarity. There, it is the product of the set field boost (or the product of all fields boosts, if multiple have been set on the field) and "lengthNorm" (which is a calculated factor designed to weigh matches on shorter documents more heavily). Neither of these is dependent on the makeup of the query, and so are good choices to be calculated and stored at index time instead.

They are then stored in a compressed, and highly lossy, single-byte format (with approx. 1 significant decimal digit of accuracy).

0人赞添加讨论(0) 举报