lucene Fields vs. DocValues

2019-02-09 14:43发布

问题:

I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.

So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields (like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?

First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?

Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?

Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...

Also, and perhaps most important, when should I use DocValues and when regular fields?

Joseph

回答1:

Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:

DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, then traditional indexing.

...

DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.

The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.

If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it in the new location, compared to the previous approach where it'd change loads of dependencies (and reindexing were the only viable strategy).

Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.



标签: solr lucene