I'm trying to cause Solr to extract only the second 7 digit portion of a ticket formatted like n-nnnnnnn
Originally I hoped to keep the full ticket together. According to documentation digits with numbers should be kept together but after hammering away a this problem for some time and looking at the code I don't think that's the case. Solr always generates two terms. So rather than large numbers of matches for the first digit of n- I'm thinking I can get better query results from just the second portion. Substituting an A for a dash:
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b\d[A](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all"
maxBlockChars="20000"/>
will parse 1A1234567 fine But -\b" replacement="$1" replace="all" maxBlockChars="20000"/>
will not parse 1-1234567
So it looks like just a problem with the hyphen. I've tried -(escaped) and [-] and \u002D and \x{45} and \x045 without success.
I've tried putting char filters around it:
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b\d[-](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all" maxBlockChars="20000"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping2.txt"/>
with mappings:
"-" => "z"
and then
"z" => "-"
I looks like the hyphen is eaten up in the Flex tokenization and isn't even available to the char filter.
Has anyone had more success with hyphen/dash in Solr/Lucene? Thanks
If your Solr is using a recent Lucene (3.x+ I think), you will want to use a ClassicAnalyzer rather than a StandardAnalyzer, as StandardAnalyzer now always treats hyphens as a delimiter.