I've got strange results when I have special characters in my query.
Here is my request :
q=histoire-france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%
Parsed query :
<str name="parsedquery_toString">+((any:histoir any:franc)) ()</str>
I've got 17000 results because Solr is doing an OR (should be AND).
I have no problem when I'm using a whitespace instead of a special char :
q=histoire france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%
<str name="parsedquery_toString">+(((any:histoir) (any:franc))~2) ()</str>
2000 results for this query.
Here is my schema.xml (relevant parts) :
<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!--<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
I even tried with a PatternTokenizerFactory to tokenize on whitespaces & special chars but no change...
My current workaround is to replace all special chars by whitespaces before sending query to Solr, but it is not satisfying.
EDIT : Even with a charFilter (PatternReplaceCharFilterFactory) to replace special characters by whitespace, it doesn't work...
First line of analysis via solr admin, with verbose output, for query = 'histoire-france' :
org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement= , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
text histoire france
The '-' is replaced by ' ', then tokenized by WhitespaceTokenizerFactory. However I still have different number of results for 'histoire-france' and 'histoire france'.
Did i miss something ?
It was a bug : https://issues.apache.org/jira/browse/SOLR-3589
It is fixed in Solr 4.1 (22 January 2013)
You get different number of results searching for 'histoire-france' and 'histoire france' because query parser creates a phrase query in the first case, and a boolean query (separate two words) in the second case.
This is not obvious behavior imho, but i believe it's hard to satisfy all use cases.
To make search treating 'histoire-france' as simply two words you can add "solr.PositionFilterFactory" to the end of query analyzer like:
Then search results for 'histoire-france' and 'histoire france' will be equal.
Note that position filter can be undesired for phrase searches (both 'historie' and 'france' to be present). Consider using of query slops parameter qs > 0 instead in case you have modified term sequence with say NGram filter.
Enable the autoGeneratePhraseQueries to true and this would generate the phrase queries.
So when searched for histoire-franc, it would generate a query with quotes which will enable only the documents having both words as a phrase being matched.
Example working configuration -
Use query slop to specify the number of slops e.g.
qs=10
in a phrase query.using
WhitespaceTokenizerFactory
, Solr will split your query string into words.But, after tokenizing you(Solr) split your word (again) into terms using solr.WordDelimiterFilterFactory. Look at the documentation and look at the Wi-Fi example.
That could be one reason, why
histoire france
andhistoire-france
are handled different.2nd: don't forget, that the DSIMAX handles (normally) the query-term as "term" and also (additional) as parsed string again.
To solve your problem, you could try to avoid the world delimiter and try to handle "tokenizing" by using
PatternTokenizerFactory
(as you tried before, but now without WordDelimiterFilterFactory).If that doesn't work, try to post the complete output of the analysys.jsp