Lowercase filter factory doesn't work when doc

2019-07-21 00:42发布

问题:

I am trying to achieve case insensitive sorting using Solr and faced this issue.

[Copied]

....But When I get search result its not sorted case insensitive. It gives all camel case result first and then all lower case

If I m having short names

Banu

Ajay

anil

sudhir

Nilesh

It sorts like Ajay, Banu, Nilesh, anil, sudhir
...................

I followed the solution and made the following changes in my solr schema.xml file (only relevent field and field type is shown):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
	<types>
		...............
		<fieldType class="org.apache.solr.schema.TextField" name="TextField">
			<analyzer>
				<tokenizer class="solr.KeywordTokenizerFactory"/>
				<filter class="solr.LowerCaseFilterFactory"/>
			</analyzer>
		</fieldType>
		.............
	</types>
	<fields>
	.................
		<field indexed="true" multiValued="false" name="name" stored="true" type="TextField" docValues="true" />
	................	
	</fields>
	<uniqueKey>id</uniqueKey>
	</schema>

But that didn't solve the sorting issue. So I removed docValues="true" from the field definition and tried again. This time sorting worked fine, but I had to specify useFieldCache=true in the query.

Why solr.LowerCaseFilterFactory is not working with docValues="true"?

Is there any other ways to make case insensitive sorting to work without removing docValues="true" and specifying useFieldCache=true?

Update:

I followed ericLavault's advice and implemented Update Request processor. But now I am facing the following issues:

1) We are using dse search. So followed the method specified in this article.

Our current table schema:

CREATE TABLE IF NOT EXISTS test_data(
    id      UUID,   
    nm      TEXT,   
    PRIMARY KEY (id)

Solr schema :

 Solr schema :

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
	<types>
		<fieldType class="org.apache.solr.schema.UUIDField" name="UUIDField"/>
		<fieldType class="org.apache.solr.schema.StrField" name="StrField"/>
	</types>
	<fields>
		<field indexed="true" multiValued="false" name="nm" stored="true" type="StrField" docValues="true"/>
		<field indexed="true" multiValued="false" name="id" stored="true" type="UUIDField"/>
		<field indexed="true" multiValued="false" name="nm_s" stored="true" type="StrField" docValues="true"/>
	</fields>
	<uniqueKey>id</uniqueKey>
</schema>

As advised , I converted nm to lowecase and inserted as nm_s using update request processor. Then reloaded the schema and reindexed . But while querying using this select nm from test_data where solr_query='{"q": "(-nm:(sssss))" ,"paging":"driver","sort":"nm_s asc"}';

I am getting the following error:

...enable docvalues true n reindex or place useFieldCache=true...

2) How can I ensure that the value nm_s is properly updated? Is there any way to see the value of nm_s?

3) Why am I getting the above mentioned error even if docValues is enabled?

回答1:

This issue probably comes from the fact that DocValues was designed to support unanalyzed types originally. It does not support TextField :

DocValues are only available for specific field types. The types chosen determine the underlying Lucene docValue type that will be used. The available Solr field types are:

  • StrField and UUIDField :
    • If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type.
    • If the field is multi-valued, Lucene will use the SORTED_SET type.
  • Any Trie* numeric fields, date fields and EnumField.
    • If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type.
    • If the field is multi-valued, Lucene will use the SORTED_SET type.

(quoted from https://cwiki.apache.org/confluence/display/solr/DocValues)

There is an issue on Solr Jira to add docValues support for TextField (SOLR-8362), but still open and unassigned.


To make case insensitive sorting work without removing docValues="true", you will have to use a string field type (solr.StrField), but since you can't define any <analyser> with string type you will need an Update Request Processor to lowercase the input stream (or equivalent like preprocessing the field content before sending data to Solr).

If you want your field to be tokenized for search and sorted using DocValues, you may use a copyField based on your actual text field (without DocValues) and a string field to be sorted on (processed for lowercase and with DocValues enabled).