I don't understand what they are, and would really appreciate a simple explanation showing what value they bring to the world without too much implementation detail of how they work.
相关问题
- JCR-SQL - contains function doesn't escape spe
- Match lucene entire field exact value
- How to rank documents using tfidf similairty in lu
- Lucene Query on a DateField indexed by Solr
- How to token a word which combined by two words wi
相关文章
- Solr - _version_ field must exist in schema and be
- CakePHP with Lucene
- Apache Lucene doesn't filter stop words despit
- Sort by date in Solr/Lucene performance problems
- What Solr tokenizer and filters can I use for a st
- Solr: How to dynamically elevate limited number of
- Finding a single fields terms with Lucene (PyLucen
- how to add custom stop words using lucene in java
When you index, process, your source information you will treat some documents and fields as more important than others.
For example, the task is to spy on your colleagues' emails. A word match in the title field is more important than a word match in the body field. We do this by multiplying the number of matches in the title field by a number larger than we use for body field matches.
Example Indexable Email Records
So, searching for 'sick' and multiplying a title match by 4 and body match by 2 and ordering highest score first - the documents are ranked ID 9 first and ID 8 second (see table 1 below).
Table 1: Matches for the word 'sick' ordered by score (descending)
These numbers, 4 and 2, we are multiplying the matches with are the norms.
A norm is part of the calculation of a score. The norm could be calculated however you like, really. The main thing that sets the norm apart, is it's calculated at index-time. Generally, other factors influencing score are calculated at query time, based on how well the document matches the query. The
norm
saves on query performance by being stored along with the document, instead.The standard implementation can be found, and well described, in Lucene's TFIDFSimilarity. There, it is the product of the set field boost (or the product of all fields boosts, if multiple have been set on the field) and "lengthNorm" (which is a calculated factor designed to weigh matches on shorter documents more heavily). Neither of these is dependent on the makeup of the query, and so are good choices to be calculated and stored at index time instead.
They are then stored in a compressed, and highly lossy, single-byte format (with approx. 1 significant decimal digit of accuracy).