Problem

I need to speedup this kind of query:

db.col.find({ a: "foobar", b: { $exists: true} });

Data Distribution

Existence of fields:

The field a exists in all documents,
The field b exists only in ~10% of them.

Current Table Stats:

db.col.count() // 1,050,505
db.col.count({ a : "foobar" }) // 517.967
db.col.count({ a : "foobar", b : { $exists: true} }) // 44.922
db.col.count({ b : { $exists: true} }) // 88.981

Futrue data growth:

So far two batches where loaded (2x around 500,000). Each month another batch of ~500,000 documents will be added. The a field is the name of this batch. Those newly added documents will have the same distribution of fields (around 10% of the newly loaded documents will have the b field)

My tries and research

I created a sparse index on {a:1, b:1} but because a is present in all documents, that doesn't speed it up. Thats because the behaviour of sparse indexes in MongoDB. From the docs:

Sparse compound indexes that only contain ascending/descending index keys will index a document as long as the document contains at least one of the keys.

This is the .explain() of the upper query:

{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "myCol",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [ 
                {
                    "a" : {
                        "$eq" : "foobar"
                    }
                }, 
                {
                    "b" : {
                        "$exists" : true
                    }
                }
            ]
        },
        "winningPlan" : {
            "stage" : "KEEP_MUTATIONS",
            "inputStage" : {
                "stage" : "FETCH",
                "filter" : {
                    "b" : {
                        "$exists" : true
                    }
                },
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "keyPattern" : {
                        "a" : 1,
                        "b" : 1
                    },
                    "indexName" : "a_1_b_1",
                    "isMultiKey" : false,
                    "direction" : "forward",
                    "indexBounds" : {
                        "a" : [ 
                            "[\"foobar\", \"foobar\"]"
                        ],
                        "b" : [ 
                            "[MinKey, MaxKey]"
                        ]
                    }
                }
            }
        },
        "rejectedPlans" : []
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 44922,
        "executionTimeMillis" : 208656,
        "totalKeysExamined" : 517967,
        "totalDocsExamined" : 517967,
        "executionStages" : {
            "stage" : "KEEP_MUTATIONS",
            "nReturned" : 44922,
            "executionTimeMillisEstimate" : 180672,
            "works" : 550772,
            "advanced" : 44922,
            "needTime" : 473045,
            "needFetch" : 32804,
            "saveState" : 41051,
            "restoreState" : 41051,
            "isEOF" : 1,
            "invalidates" : 0,
            "inputStage" : {
                "stage" : "FETCH",
                "filter" : {
                    "b" : {
                        "$exists" : true
                    }
                },
                "nReturned" : 44922,
                "executionTimeMillisEstimate" : 180612,
                "works" : 550772,
                "advanced" : 44922,
                "needTime" : 473045,
                "needFetch" : 32804,
                "saveState" : 41051,
                "restoreState" : 41051,
                "isEOF" : 1,
                "invalidates" : 0,
                "docsExamined" : 517967,
                "alreadyHasObj" : 0,
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "nReturned" : 517967,
                    "executionTimeMillisEstimate" : 3035,
                    "works" : 517967,
                    "advanced" : 517967,
                    "needTime" : 0,
                    "needFetch" : 0,
                    "saveState" : 41051,
                    "restoreState" : 41051,
                    "isEOF" : 1,
                    "invalidates" : 0,
                    "keyPattern" : {
                        "a" : 1,
                        "b" : 1
                    },
                    "indexName" : "a_1_b_1",
                    "isMultiKey" : false,
                    "direction" : "forward",
                    "indexBounds" : {
                        "a" : [ 
                            "[\"foobar\", \"foobar\"]"
                        ],
                        "b" : [ 
                            "[MinKey, MaxKey]"
                        ]
                    },
                    "keysExamined" : 517967, // INFO: I think that this is too much. These are all documents having a:"foobar"
                    "dupsTested" : 0,
                    "dupsDropped" : 0,
                    "seenInvalidated" : 0,
                    "matchTested" : 0
                }
            }
        },
        "allPlansExecution" : []
    },
    "serverInfo" : {
        "host" : "productive-mongodb-16",
        "port" : 27000,
        "version" : "3.0.1",
        "gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
    }
}

a exists in all 1,000,000 documents and 520,000 of them have a:"foobar". In the whole collection 88,000 having the b field.

How to speedup my query (so that IXSCAN returns only 44k instead of 520k)?

What you do not seem to be understanding here is that $exists cannot "grab" an index in any way, even where sparse. As the documentation itself says:

"If a sparse index would result in an incomplete result set for queries and sort operations, MongoDB will not use that index"

The example given in those pages is an { "$exists": false } query. But the reverse logical condition does not make any difference here.

In order to get the "full benefit" of a "sparse" index then you need to consider the "type" of data it holds and query appropriately.

For numeric, something like:

db.collection.find({ "a": "foobar", "b": { "$gte": -9999, "$lte": 9999 } })

Which uses an index, and the sparse one. Or for text based:

db.collection.find({ "a": "foobar", "b": /.+/ })

Which will also use the sparse index and only look at those where "b" was defined.

For "arrays" then "be careful". As the value being looked at is probably one of the above unless you did this:

db.collection.insert({ "a": 1, "b": [[]] })

Where then this is okay:

db.ab.find({ "a": 1, "b": { "$type": 4 } })

But is not really going to use the "sparse" index either for the same reasons $exists won`t work here.

So you need to understand what the terms mean here are, and "query appropriately" in order to use the index definitions that you create if you expect the maximum performance.

These are clear examples you can test for yourself and see the results are true. I do wish the core documentation was clearer on these points, but I am also aware many have tried to contribute ( and have produced excellent explainations ) but none of these have been included to date.

Guess that is why you are asking here.