Problem
I need to speedup this kind of query:
db.col.find({ a: "foobar", b: { $exists: true} });
Data Distribution
Existence of fields:
- The field
a
exists in all documents, - The field
b
exists only in ~10% of them.
Current Table Stats:
db.col.count() // 1,050,505
db.col.count({ a : "foobar" }) // 517.967
db.col.count({ a : "foobar", b : { $exists: true} }) // 44.922
db.col.count({ b : { $exists: true} }) // 88.981
Futrue data growth:
So far two batches where loaded (2x around 500,000).
Each month another batch of ~500,000 documents will be added.
The a
field is the name of this batch. Those newly added documents will have the same distribution of fields (around 10% of the newly loaded documents will have the b
field)
My tries and research
I created a sparse index on {a:1, b:1}
but because a
is present in all documents, that doesn't speed it up. Thats because the behaviour of sparse indexes in MongoDB. From the docs:
Sparse compound indexes that only contain ascending/descending index keys will index a document as long as the document contains at least one of the keys.
This is the .explain()
of the upper query:
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "myCol",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"a" : {
"$eq" : "foobar"
}
},
{
"b" : {
"$exists" : true
}
}
]
},
"winningPlan" : {
"stage" : "KEEP_MUTATIONS",
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"b" : {
"$exists" : true
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"a" : 1,
"b" : 1
},
"indexName" : "a_1_b_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"a" : [
"[\"foobar\", \"foobar\"]"
],
"b" : [
"[MinKey, MaxKey]"
]
}
}
}
},
"rejectedPlans" : []
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 44922,
"executionTimeMillis" : 208656,
"totalKeysExamined" : 517967,
"totalDocsExamined" : 517967,
"executionStages" : {
"stage" : "KEEP_MUTATIONS",
"nReturned" : 44922,
"executionTimeMillisEstimate" : 180672,
"works" : 550772,
"advanced" : 44922,
"needTime" : 473045,
"needFetch" : 32804,
"saveState" : 41051,
"restoreState" : 41051,
"isEOF" : 1,
"invalidates" : 0,
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"b" : {
"$exists" : true
}
},
"nReturned" : 44922,
"executionTimeMillisEstimate" : 180612,
"works" : 550772,
"advanced" : 44922,
"needTime" : 473045,
"needFetch" : 32804,
"saveState" : 41051,
"restoreState" : 41051,
"isEOF" : 1,
"invalidates" : 0,
"docsExamined" : 517967,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 517967,
"executionTimeMillisEstimate" : 3035,
"works" : 517967,
"advanced" : 517967,
"needTime" : 0,
"needFetch" : 0,
"saveState" : 41051,
"restoreState" : 41051,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"a" : 1,
"b" : 1
},
"indexName" : "a_1_b_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"a" : [
"[\"foobar\", \"foobar\"]"
],
"b" : [
"[MinKey, MaxKey]"
]
},
"keysExamined" : 517967, // INFO: I think that this is too much. These are all documents having a:"foobar"
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 0
}
}
},
"allPlansExecution" : []
},
"serverInfo" : {
"host" : "productive-mongodb-16",
"port" : 27000,
"version" : "3.0.1",
"gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
}
}
a
exists in all 1,000,000 documents and 520,000 of them have a:"foobar"
. In the whole collection 88,000 having the b
field.
How to speedup my query (so that IXSCAN returns only 44k instead of 520k)?