Challenge with hyphens/dashes in Solr Lucene

2019-02-28 09:14发布

问题:

I'm trying to cause Solr to extract only the second 7 digit portion of a ticket formatted like n-nnnnnnn

Originally I hoped to keep the full ticket together. According to documentation digits with numbers should be kept together but after hammering away a this problem for some time and looking at the code I don't think that's the case. Solr always generates two terms. So rather than large numbers of matches for the first digit of n- I'm thinking I can get better query results from just the second portion. Substituting an A for a dash:

    <charFilter class="solr.PatternReplaceCharFilterFactory"
      pattern="\b\d[A](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all" 
      maxBlockChars="20000"/>

will parse 1A1234567 fine But -\b" replacement="$1" replace="all" maxBlockChars="20000"/>

will not parse 1-1234567

So it looks like just a problem with the hyphen. I've tried -(escaped) and [-] and \u002D and \x{45} and \x045 without success.

I've tried putting char filters around it:

   <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
    <charFilter class="solr.PatternReplaceCharFilterFactory"
      pattern="\b\d[-](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all" maxBlockChars="20000"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping2.txt"/>

with mappings:

"-" => "z"

and then

"z" => "-"

I looks like the hyphen is eaten up in the Flex tokenization and isn't even available to the char filter.

Has anyone had more success with hyphen/dash in Solr/Lucene? Thanks

回答1:

If your Solr is using a recent Lucene (3.x+ I think), you will want to use a ClassicAnalyzer rather than a StandardAnalyzer, as StandardAnalyzer now always treats hyphens as a delimiter.