What are the limitations of boolean query in Lucen

2019-09-05 14:08发布

问题:

I have a requirement to find items in a Lucene index that have two basic criterion: 1. match a specific string called a 'relation' 2. fall within a list of entitlement 'grant groups'

An entitlement group defines a subset of items accessible by a member of that group and is much like an authorization role.

All documents in the Lucene index have the 'relation' field and, for simplicity sake, one or more 'grant-group' fields.

So, for example, a user may search for 'foobar' and that user may be a member of groups a, b, c. foobar, let's say, has grant groups a,p,q,s

The query will be, basically, "match 'foobar' AND (a OR b OR c).

This should work according to Lucene documentation.

My question is this: How far can you go with the 2nd part of the boolean query, namely, the part after 'AND' ? The reason for asking is this: I am about to do a small feasibility study and part of the requirements is the need to support potentially MANY groups in the 'OR' clause. Possibly up to 200 or 300 groups.

Would there be noticeable performance degradation ?

thanks.

回答1:

From this overview of lucene performance:

To put it another way: for standard disjunctive (OR'd) queries, the number of clauses doesn't really affect performance, except to the extent that more documents are potential matches.

As Avi mentioned, you will hit a limit at 1024 clauses.



回答2:

You should measure, whatever you do. I think you probably should be ok with 200-300 groups. I think the default limit of clauses in a BooleanQuery is 1024, but that can be changed as well.

If you use Solr, rather than straight Lucene, then I would recommend putting the grant-group part as a filterQuery, so that it can be cached.



回答3:

I'm not sure how many elements you can specify in OR, perhaps you should do a simple proof of concept just to see how it works.

Apart from that, if you use Solr, I would not alter original query with to implement your requirements (it would affect scoring on matched documents) but would rather use 'fq' parameter (see Filter Query):