We need to create a compound index in the same order as the parameters are being queried. Does this order matter performance-wise at all?
Imagine we have a collection of all humans on earth with an index on sex
(99.9% of the time "male" or "female", but string nontheless (not binary)) and an index on name
.
If we would want to be able to select all people of a certain sex
with a certain name
, e.g. all "male"s named "John", is it better to have a compound index with sex
first or name
first? Why (not)?
Redsandro,
You must consider
Index Cardinality
andSelectivity
.1. Index Cardinality
The index cardinality refers to how many possible values there are for a field. The field
sex
only has two possible values. It has a very low cardinality. Other fields such asnames, usernames, phone numbers, emails
, etc. will have a more unique value for every document in the collection, which is considered high cardinality.Greater Cardinality
The greater the cardinality of a field the more helpful an index will be, because indexes narrow the search space, making it a much smaller set.
If you have an index on
sex
and you are looking for men named John. You would only narrow down the result space by approximately %50 if you indexed bysex
first. Conversely if you indexed byname
, you would immediately narrow down the result set to a minute fraction of users named John, then you would refer to those documents to check the gender.Rule of Thumb
Try to create indexes on
high-cardinality
keys or puthigh-cardinality
keys first in the compound index. You can read more about it in the section on compound indexes in the book:MongoDB The Definitive Guide
2. Selectivity
Also, you want to use indexes selectively and write queries that limit the number of possible documents with the indexed field. To keep it simple, consider the following collection. If your index is
{name:1}
, If you run the query{ name: "John", sex: "male"}
. You will have to scan1
document. Because you allowed MongoDB to be selective.Consider the following collection. If your index is
{sex:1}
, If you run the query{sex: "male", name: "John"}
. You will have to scan4
documents.Imagine the possible differences on a larger data set.
A little explanation of Compound Indexes
It's easy to make the wrong assumption about Compound Indexes. According to the MongoDB Indexes Guide.
When you create a compound index, 1 Index will hold multiple fields. So if we index a collection by
{"sex" : 1, "name" : 1}
, the index will look roughly like:If we index a collection by
{"name" : 1, "sex" : 1}
, the index will look roughly like:Having
{name:1}
as the Prefix will serve you much better in using compound indexes. There is much more that can be read on the topic, I hope this can offer some clarity.I'm going to say I did an experiment on this myself, and found that there seems to be no performance penalty for using the poorly distinguished index key first. (I'm using mongodb 3.4 with wiredtiger, which may be different than mmap). I inserted 250 million documents into a new collection called
items
. Each doc looked like this:"field1"
was always equal to"bob"
."field2"
was equal toi
, so it was completely unique. First I did a search on field2, and it took over a minute to scan 250 million documents. Then I created an index like so:Of course field1 is "bob" on every single document, so the index should have to search a number of items before finding the desired document. However, this was not the result I got.
I did another search on the collection after the index finished creating. This time I got results which I listed below. You'll see that
"totalKeysExamined"
is 1 each time. So perhaps with wired tiger or something they have figured out how to do this better. I have read the wiredtiger actually compresses index prefixes, so that may have something to do with it.db.items.find({field1:"bob",field2:"250888000"}).explain("executionStats")
Then I created an index on
field3
(which has the same value as field 2). Then I searched:db.items.find({field3:"250888000"});
It took the same 4ms as the one with the compound index. I repeated this a number of times with different values for field2 and field3 and got insignificant differences each time. This suggests that with wiredtiger, there is no performance penalty for having poor differentiation on the first field of an index.