How does Solr's schema-less feature work? How

Just found that Solr 5 doesn't require a schema file to be predefined and it generates the schema, based on the indexing being performed. I would like to know how does this work in the background?

And whether it's a good practice or not? Is there any way to disable it?

回答1:

The schemaless feature has been in Solr since version 4.3. But it might be more stable only now as a concurrency issue with it was fixed in 4.10.

It is also called managed schema. When you configure Solr to use managed schema, Solr uses a special UpdateRequestProcessor to intercept document indexing requests and it guesses field types.

Solr starts with your schema.xml file and creates a new file called, by default, managed-schema to store all the inferred schema information. This file is automatically overwritten by Solr as it detects changes to the schema.

You should then use the Schema API if you want to make changes to the Schema. See also the Schemaless Mode documentation.

How to change Solr managed schema to classic schema

Stop Solr: $ bin/solr stop

Go to server/solr/mycore/conf, where "mycore" is the name of your core/collection.

Edit solrconfig.xml:

search for <schemaFactory class="ManagedIndexSchemaFactory"> and comment the whole element
search for <schemaFactory class="ClassicIndexSchemaFactory"/> and uncomment it
search for the <initParams> element that refers to add-unknown-fields-to-the-schema and comment out the whole <initParams>...</initParams>

Rename managed-schema to schema.xml and you are done.

You can now start Solr again: $ bin/solr start, go to http://localhost:8983/solr/#/mycore/documents and check that Solr now refuses to index a document with a new field not yet specified in schema.xml.

Is it a good practice? When to use it?

It depends on what you want. If you want to enforce a specific document structure (e.g. to make sure that all docs are "well-formed" according to your definition), then you want to use the classical schema management.

If on the other hand you don't know upfront what the doc structure is then you might want to use the schema-less feature.

Limits

While it is called schema-less, there are limits to the kinds of structures that you can index. This is true both for Solr and Elasticsearch, by the way. For example, if you first index this doc:

{"name":"John Doe"}

then you will get an error if you try to index a doc like that next:

{"name": {
   "first": "Daniel",
   "second": "Dennett"
   }
}

That is because in the first case the field name was of type string while in the second case it is an object.

If you would like to use indexing which goes beyond these limitations then you could use SIREn - it is an open source semi-structured information retrieval engine which is implemented as a plugin for both Solr and Elasticsearch. (Disclaimer: I worked for the company that develops SIREn)

回答2:

This is so called schemaless mode in Solr. I don't know about internal details, how it's implemented, etc.