How solr filters actually implemented?

2019-07-13 13:11发布

问题:

Is my understanding of query processing correct?

  1. Get DocSet from cache or First filter query will create implementation of OpenBitSet or SortedVIntSet and cache it
  2. Get DocSet from cache or All other filters create their implementation of DocBitSet and it will be intersected with original (efficiency of this code depends on implementation of first implementation of DocSet)
  3. We do leapfrog with MainQuery and final DocSet(after all intersections) using Lucene filter+query search(efficiency of this is dependent on first DocSet implementation)
  4. We apply post filters(cost > 100 && cache==false) as AND of orignal query

So as a consequence performance will be dependent on first filter since for small query SortedIntSet is more efficient and for big BitSet is better. Am I correct?

Second part of question: DocSet has two main implementation - HashDocSet and SortedIntDoc, each intersection implementation iterates over all instances in first filter and check if it is also in second DocSet... That means we have to sort filters by size, smallest first. Is it possible to control order of cached filters(cost only works for non cached filters)?

回答1:

It sounds good. For more information, have a look at SolrIndexSearcher#getProcessedFilter.

So as a consequence performance will be dependent on first filter since for small query SortedIntSet is more efficient and for big BitSet is better. Am I correct?

This is more a problem of space efficiency than a problem of speed. A sorted int[] costs 4 * nDocs bytes while a bit set costs maxDoc / 8 bytes, this is why Solr uses sorted int[] whenever the number of documents in the set is < maxDoc / 32.

Second part of question: DocSet has two main implementation - HashDocSet and SortedIntDoc

The problem with SortedIntDocSet is that it doesn't support random access, and the problem with HashDocSet is that it can't enumerate doc IDs in order, which can be important for scoring. This is why Solr uses SortedIntDocSets almost everywhere and creates a transient HashDocSet whenever it needs random access (look at JoinQParserPlugin or DocSlice#intersect for example).