We have a solr instance with 86,315,770 documents. It's using up to 4GB of memory and we need it for faceting on a tokenized field called content. The index size on disk is 23GB.
Why are we faceting on a tokenized field? Because we want to query for the top "n" most used terms on that field. Problem is it is taking way too long to perform such queries. Is there any way to improve times when doing faceting like this? Any recommendations?
Thanks in advance.
Since Solr computes facets on in-memory data-structures, facet computation is likely to be CPU-bound. The code to compute facets is already highly optimised (the getCounts
method in UnInvertedField for a multi-valued field).
One idea would be to parallelize the computation. Maybe the easiest way to do this would be to split your collection into several shards as described in Do multiple Solr shards on a single machine improve performance?.
Otherwise, if your term dictionary is small enough and if queries can take a limited number of forms, you could set up a different system that would maintain the count matrix for every (term, query) pair. For example, if you only allow term queries, this means you should maintain the counts for every pair of terms. Beware that this would require a lot of disk space depending of the total number of terms and queries. If you don't require the counts to be exact, maybe the easiest would be to compute these counts in a batch process. Otherwisee, it might be (possible, but) a little bit tricky to keep the counts sync'd with Solr.
You could use the topTerms
feature of LukeRequestHandler.