Solr Strip html when highlighting with stored html

2019-07-11 07:53发布

问题:

Using Solr and Sunspot in rails.

I am searching on an html field using a field type like this:

<fieldType name="text_html" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ISOLatin1AccentFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

I am then performing a search and using a stored field so that I can return highlighted text in the results. The problem I am having is that the stored value has the original html text in it. For example: a search on 'news' is returning:

"community connection to @@@hl@@@news@@@endhl@@@, sports, local deals and all the latest conversations.</div>\n</div>\n</div>"

I then want to replace tags @@@hl@@@, @@@endhl@@@ with html wrapped tags.

Do I need to manually strip out the original html tags (divs, etc) tags out myself or is there a way to get the stored value to already have html tags stripped out?

I know how to do this manually, just wanted to make sure I wasn't missing something in the schema.xml or solrconfig.xml.

Thanks

回答1:

You will need to manually strip that data/formatting out either prior to inserting into Solr or after retrieving from the index. The Analyzers, Tokenizers, and Token Filters in Solr run against the field and perform their actions against the value passed prior to inserting tokens/terms into the index for that document or during the query processing. However, it will always store the field value for returning with query results in the original form passed in.

If you happen to be using the DataImportHandler to load your data into Solr, it provides an HtmlStripTransformer and/or RegExTransformer you could leverage to remove the html tags.



回答2:

For my project I also needed to strip HTML tags before indexing, and my google search brought me here first. After a short visit to the docs linked to by Paige Cook, I spotted where the problem with your schema.xml might be.

According to Solr documentation, <charFilter> tags must come before the <tokenizer> tag.

So I think you should have something like this:

<fieldType name="text_html" class="solr.TextField" omitNorms="false">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ISOLatin1AccentFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>