stop words in sitecore

2019-04-06 10:49发布

问题:

We are using Lucene for text search as part of sitecore. Is there any method to ignore stop words (like a,an,the...) in the sitecore search?

回答1:

By default, Sitecore uses Lucene standard analyzer - Lucene.Net.Analysis.Standard.StandardAnalyzer. You can see this is defined in /configuration/sitecore/search/analyzer element of web.config file. One of the constructors of StandardAnalyzer class accepts the array of strings it will consider stop words. By default it uses the hardcoded list of stop words which include:

"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"

If you'd like to override this behavior, I think you should inherit StandardAnalyzer and override its default constructor to take the stop words from another source instead of the hardcoded array. You have various options, even reading it from a text file. Don't forget to replace the standard class with yours in web.config.

See other constructors of StandardAnalyzer class for more details. .NET Reflector is your friend here.



回答2:

An example for Yans post:

public class CaseAnalyzer : Lucene.Net.Analysis.Standard.StandardAnalyzer
{
   private static Hashtable stopWords = new Hashtable(); //{{"by","by"}}; <-- Makes "by" a stopword that will not be matched in analyzer
   public CaseAnalyzer() : base(Lucene.Net.Util.Version.LUCENE_29, stopWords)
   {      
   }
}

this should be registered in the web.config under

/configuration/sitecore/search/analyzer

an example of the analyzer registration

<caseanalyzer type="EBF.Business.Search.Analyzers.CaseAnalyzer, EBF.Business, Version=1.0.0.0, Culture=neutral"/>

Lastly you just need to register your analyzer in the search configuration like this

<Analyzer ref="search/caseanalyzer" />