Optimizing Solr for Sorting

2020-05-24 06:07发布

问题:

I'm using Solr for a realtime search index. My dataset is about 60M large documents. Instead of sorting by relevance, I need to sort by time. Currently I'm using the sort flag in the query to sort by time. This works fine for specific searches, but when searches return large numbers of results, Solr has to take all of the resulting documents and sort them by time before returning. This is slow, and there has to be a better way.

What is the better way?

回答1:

I found the answer.

If you want to sort by time, and not relevance, use fq= instead of q= for all of your filters. This way, Solr doesn't waste time figuring out the weighted value of the documents matching q=. It turns out that Solr was spending too much time weighting, not sorting.

Additionally, you can speed sorting up by pre-warming your sort fields in the newSearcher and firstSearcher event listeners in solrconfig.xml. This will ensure that sorts are done via cache.



回答2:

Obvious first question: what's type of your time field? If it's string, then sorting is obviously very slow. tdate is even faster than date.

Another point: do you have enough memory for Solr? If it starts swapping, then performance is immediately awful.

And third one: if you have older Lucene, then date is just string, which is very slow.



回答3:

Warning: Wild suggestion, not based on prior experience or known facts. :)

  1. Perform a query without sorting and rows=0 to get the number of matches. Disable faceting etc. to improve performance - we only need the total number of matches.
  2. Based on the number of matches from Step #1, the distribution of your data and the count/offset of the results that you need, fire another query which sorts by date and also adds a filter on the date, like fq=date:[NOW()-xDAY TO *] where x is the estimated time period in days during which we will find the required number of matching documents.
  3. If the number of results from Step #2 is less than what you need, then relax the filter a bit and fire another query.

For starters, you can use the following to estimate x:

If you are uniformly adding n documents a day to the index of size N documents and a specific query matched d documents in Step #1, then to get the top r results you can use x = (N*r*1.2)/(d*n). If you have to relax your filter too often in Step #3, then slowly increase the value 1.2 in the formula as required.



标签: lucene solr