I have documents in Solr/Lucene (3.x) with a special copy field facet_headline in order to have an unstemmed field for faceting.
Sometimes 2 ore more words are belong together, and this should be handled/counted as one word, for example "kim jong il".
So the headline "Saturday: kim jong il had died" should be split into:
Saturday
kim jong il
had
died
For this reason I decided to use protected words (protwords), where I add kim jong il
.
The schema.xml
looks like this.
<fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\?|\!|\.|\:|\;|\,|\"|\(|\)|\\|\+|\*|<|>|([0-31]+\.)" />
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="0"
protected="protwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
</analyzer>
</fieldType>
Using the solr analysis it looks like that doesn't work!
The string is still split into 6 words. It looks like the protword.txt is not used, but if the headline ONLY contains the name: kim jong il
everything works fine, the terms aren't split.
Is there a way to reach my goal: not to split specific words/word groups?
Here's what I think it's happening.
WordDelimiterFilterFactory
is a token filter, so its job is to add, remove or change already generated tokens, (in this case, to split words into sub-words based on case transitions, hyphens, etc.), not to split documents into words, which is a job for the tokenizer (in this case, PatternTokenizerFactory
). It seems that your tokenizer is missing a \s
, so it's not splitting words and WordDelimiterFilterFactory
is getting whole phrases.
In your example, WordDelimiterFilterFactory
would be getting the whole phrase Saturday kim jong il had died
and, as it doesn't match any of your protected words, it proceeds to split this "word" into sub-words (a whitespace is a non-alpanumeric character, so the word qualifies for splitting).
So here's a possible solution. Add a \s
to your tokenizer pattern and then use KeywordMarkerFilterFactory to protect your words. Something like this:
<fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s|\?|\!|\.|\:|\;|\,|\"|\(|\)|\\|\+|\*|<|>|([0-31]+\.)" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"
ignoreCase="false"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="0" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
</analyzer>
</fieldType>
Update: OK, now that I double checked the documentation, this proposed solution is not likely going to work for you. I would focus on experimenting with SynonymFilterFactory. Check this message in the solr-user mailing list. It's a bit outdated, but gives some insight into the problem.
after searching the web a came to the point, that it's not possible to reach the goal.
It looks like, this is not the focus of all the tokenizer and filters.