LUCENE: search for terms that match a regex

2019-08-12 04:09发布

问题:

I need to search for any terms in the lucene index, matching particular regex. I know that I can do it using the TermsComponent in solr, if it is configed like this:

<searchComponent name="terms" class="solr.TermsComponent"/>

  <!-- A request handler for demonstrating the terms component -->
  <requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
     <lst name="defaults">
      <bool name="terms">true</bool>
      <bool name="distrib">false</bool>
    </lst>    
    <arr name="components">
      <str>terms</str>
    </arr>
  </requestHandler>

For example, I want to fetch any terms containing "surface defects". Using solr I can do this:

http://localhost:8983/solr/core1/terms?terms.fl=content&
         terms.regex=^(.*?(\bsurface%20defects\b)[^$]*)$&
         terms.sort=count&
         terms.limit=10000

But my question is, how can I achieve the same by using the Lucene API, not solr? I looked into the org.apache.solr.handler.component.TermsComponent class but it is not very obvious for me.

回答1:

You can use a RegexQuery:

Query query = new RegexQuery(new Term("myField", myRegex));

Or the QueryParser:

String queryString = "/" + myRegex + "/";
QueryParser parser = new QueryParser("myField", new KeywordAnalyzer());
Query query = parser.parse(queryString);

Now, my question is: Are you sure that regex works in Solr?

I haven't tried the TermsComponent regex functionality, so maybe it's doing some fancy SpanQuery footwork here, or running regexes on the stored fields retrieved, or something like that, but you are using regex syntax that is not supported by Lucene, and may be making some general assumptions about how regexes work in Lucene that are not accurate.

  • The big one: a lucene regex query must match the whole term. If your field is not analyzed, the general idea here should work. If they are analyzed with, say, StandardAnalyzer, you can not use a regex query to search like this, since "surface defects" would be split into multiple terms. On the plus side, in that case, a simple PhraseQuery would probably work just fine, as well as being faster and easier (In general, on Lucene regex queries: You probably don't need them, and if you do, you probably should have analyzed better).

  • ^ and $ won't work. You are attempting to match terms, and must match the whole term in order to match. As such, these don't serve any purpose, and aren't supported.

  • .*? not really wrong, but reluctant matching isn't supported, so it is redundant. .* does the same thing here.

  • [^$]* if you are trying not to match dollar signs, fine, otherwise, I'm not sure what regex engine would support this. $ in a character class is just a dollar sign.

  • \b no support in lucene regexes. The whole idea of analysis is that the content should already but split on word breaks, so what purpose would this serve?



标签: solr lucene