I am new to Apache solr, can someone please explain the meaning of following terms with examples :-
- Solr Core
- Solr Collection
- Logical vs Physical index
- Sharding
I went through various blog posts but i am not able to understand.
I am new to Apache solr, can someone please explain the meaning of following terms with examples :-
I went through various blog posts but i am not able to understand.
The terminology is used a bit haphazardly, so you'll probably find texts that use a few of these terms interchangeably.
Solr core
A core is a named set of documents living on a single server. A server can have many cores. The core can be replicated to other servers (this is "old style" replication when done manually).
Solr Collection
A collection is a set of cores, from one to .. many. It's a logical description of "these cores together form the entire collection". This was introduced with SolrCloud, as that's the first time that Solr handles clustering for you.
Logical vs Physical
A collection is a logical index - it can span many cores. Each core is a physical index (it has the actual index files from Lucene on its disk). You interact with the collection as you'd interact with the core, and all the details of clustering are (usually) hidden from you by Solr (in SolrCloud mode).
Sharding
Since a collection can span many cores, sharding means that the documents that make up a single collection are present in many cores. Each core is a "shard" of the total index. Compare this to replication, where a copy of a core is distributed to many Solr instances (the same documents are present in both cores, while when sharding the documents are just present in one core and you need all cores to have a complete collection).
Sharding is what makes it possible to store more documents than a single server can handle (or keep in memory/cache to respond quickly enough).
SolrCloud (Added by me to make this all come togheter)
Previously (and still, if you're not using SolrCloud mode) sharding and replication were handled manually by the user when querying and configuring Solr. You set up replication to spread the same core across many servers, and you used sharding to make Solr query many Solr instances to get all the required documents. Today you'll usually just use SolrCloud and let Solr abstract away all these details. You'll come across these terms when creating a collection (numShards and replicationFactor) which tells Solr how many cores you want to spread the collection across, and how many servers should hold copies of these cores.
Collection -> Sharded across [1..N] cores, replicated [0..M] times for redundancy and higher query throughput.