So, I have read multiple sources that try to explain what 'docValues' are in Solr, but I don't seem to understand when I should use them, especially in relation to indexed vs stored fields. Can anyone please throw some light on it?
相关问题
- JCR-SQL - contains function doesn't escape spe
- Solr Deduplication (dedupe) giving all zeros in si
- Solr (Sunspot), max results more than 30?
- Match lucene entire field exact value
- How to rank documents using tfidf similairty in lu
相关文章
- Solr - _version_ field must exist in schema and be
- SolrNet - Score always 0
- How can use the /export request handler via SolrJ?
- request counting for documents in apache solr
- How to search records between two coordinates usin
- Boost result by specified search term on top
- CakePHP with Lucene
- Faceted searching and categories in MySQL and Solr
Due to the way they are stored and accessed, they will speed up some operations, like sorting, faceting etc.
Besides, they are mandatory for using some features: streaming expressions, in place updates...
So, if in doubt:
Doc values can be explained as Lucene's column-stride field value storage or simply its an uninverted index or forward index.
To illustrate with json:
row-oriented (stored fields)
column-oriented (docValues)
Stored fields store all field values for one document together in a row-stride fashion. In retrieval, all field values are returned at once per document, so that loading the relevant information about a document is very fast.
However, if you need to scan a field (for faceting/sorting/grouping/highlighting) it will be a slow process, as you will have to iterate through all the documents and load each document's fields per iteration resulting in disk seeks.
For example, sorting, when all the matched documents are found, Lucene need to get the value of a field of each of them. Similarly the faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list.
Now this problem can be approached in two ways:
Like inverted index docvalues are serialized to disk in that case we can rely on the OS’s file system cache to manage memory instead of retaining structures on the JVM heap.
For all the reasons discussed above. If you are in a low-memory environment, or you don’t need to index a field, DocValues are perfect for faceting/grouping/filtering/sorting/function queries. They also have the potential for increasing the number of fields you can facet/group/filter/sort on without increasing your memory requirements. I've been using docvalues in production Solr for sorting and faceting and have seen a huge improvement in performance of these queries.
Use cases of DocValues are already explained by @Persimmonium and are pretty clear. they are good for faceting and sorting and much such fancy stuff in IR world.
What are docValue and why they are there ? docValue is nothing but is a way to build a forward index so that documents point to values. they are built to overcome the limitations of FieldCache by providing a document to value mapping built at index time and they store values in a column based fashion and it does all the heavyweight lifting during document indexing.
What docvalues are:
NRT-compatible: These are per-segment datastructures built at index-time and designed to be efficient for the use case where data is changing rapidly.
Basic query/filter support: You can do basic term, range, etc queries on docvalues fields without also indexing them, but these are constant-score only and typically slower. If you care about performance and scoring, index the field too.
Better compression than fieldcache: Docvalues fields compress better than fieldcache, and "insanity" is impossible.
Able to store data outside of heap memory: You can specify a different docValuesFormat on the fieldType (docValuesFormat="Disk") to only load minimal data on the heap, keeping other data structures on disk.
What docvalues are not:
Not a replacement for stored fields: These are unrelated to stored fields in every way and instead datastructures for search (sort/facet/group/join/scoring).
Use case to use with Lucene docValues this way.