Remove email address from solr indexing

2019-09-07 17:47发布

问题:

When Solr build the index, it gets parts of email address.

For exemple, if i have an email like this : foo@bar.com, Solr indexes the words "foo" and "barcom".

I want to remove theses words but I don't know how to do this. I tried to modify the configuration file schema.xml adding this rule on my indexed field :

<filter class="solr.PatternReplaceFilterFactory" pattern=" (.*)@(.*) " replacement=" " replace="all"/>

However, it doesn't work.

回答1:

You can detect tokens as a e-mailaddress and blacklist them using

  <fieldType name="emails" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.TypeTokenFilterFactory" types="email_type.txt" useWhitelist="true"/>
  </analyzer>
</fieldType>