Search in solr with special characters

I have a problem with a search with special characters in solr. My document has a field "title" and sometimes it can be like "Titanic - 1999" (it has the character "-"). When i try to search in solr with "-" i receive a 400 error. I've tried to escape the character, so I tried something like "-" and "\-". With that changes solr doesn't response me with an error, but it returns 0 results.

How can i search in the solr admin with that special character(something like "-" or "'"???

Regards

UPDATE Here you can see my current solr scheme https://gist.github.com/cpalomaresbazuca/6269375

My search is to the field "Title".

excerpt from the schema.xml:

 ...
 <!-- A general text field that has reasonable, generic
     cross-language defaults: it tokenizes with StandardTokenizer,
     removes stop words from case-insensitive "stopwords.txt"
     (empty by default), and down cases.  At query time only, it
     also applies synonyms. -->
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <!-- in this example, we will only use synonyms at query time
             <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
             -->
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
    </fieldType>
...
<field name="Title" type="text_general" indexed="true" stored="true"/>

标签： search solr lucene full-text-search special-characters

3条回答

傲

2楼-- · 2019-04-26 07:26

I spent a lot of time getting this done. Here is a clear step-by-step things to be done to query special characters in SolR. Hope it helps someone.

Edit the schema.xml file and find the solr.TextField that you are using.

Under both, "index" and query" analyzers modify the WordDelimiterFilterFactory and add types="characters.txt" Something like:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
 <analyzer type="index">
 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
</analyzer>
</fieldType>

Ensure that you use WhitespaceTokenizerFactory as the tokenizer as shown above.

Your characters.txt file can have entries like-

 \# => ALPHA
@ => ALPHA
\u0023 => ALPHA
                ie:- pointing to ALPHA only.

Clear the data, re-index and query for the entered characters. It will work.

0人赞添加讨论(0) 举报

走好不送

3楼-- · 2019-04-26 07:33

You are using the standard text_general field for the title attribute. This might not be a good choice. text_general is meant to be for huge chunks of text (or at least sentences) and not so much for exact matching of names or titles.

The problem here is that text_general uses the StandardTokenizerFactory.

 <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <!-- in this example, we will only use synonyms at query time
             <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
             -->
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
    </fieldType>

StandardTokenizerFactory does the following:

A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware of the same token types.

This means the '-' character will be completely ignored and be used to tokenize the String.

"kong-fu" will be represented as "kong" and "fu". The '-' disappears.

This does also explain why select?q=title:\- won't work here.

Choose a better fitting field type:

Instead of the StandardTokenizerFactory you could use the solr.WhitespaceTokenizerFactory, that only splits on whitespace for exact matching of words. So making your own field type for the title attribute would be a solution.

Solr also has a mininal fieldtype called text_ws. Depending on your requirements this might be enough.

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

4楼-- · 2019-04-26 07:43

To search for your exact phrase put inverted commas round it:

select?q=title:"Titanic - 1999"

If you just want to search for that special character then you will need to escape it:

select?q=title:\-

Also check: Special characters (-&+, etc) not working in SOLR Query

If you know exactly which special characters you dont want to use then you can add this to the regex-normalize.xml

<regex> 
  <pattern>&#x2D;</pattern> 
  <substitution>%2D</substitution> 
</regex>

This will replace all "-" with %2D, so when you search, as long as you search for %2D instead of the "-" it will work fine

0人赞添加讨论(0) 举报

Search in solr with special characters

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间