Combine full text with other index

2019-02-21 00:08发布

问题:

I have a full text index and an index on a created date.

My query on the date alone returns a nice, small 44 records very quickly (within a second):

> db.oneMillionDocumentsIndexed.count({created: {$lte: ISODate("2016-02-06T15:34:59.019Z")} })
44

However, if I combine this with the text search the query is incredibly slow:

> db.oneMillionDocumentsIndexed.count({
                                created: {$lte: ISODate("2016-02-06T15:34:59.019Z")}, 
                                $text: { $search: "raven" } })

It appears to use both indexes:

{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "test.oneMillionDocumentsIndexed",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [
                {
                    "created" : {
                        "$lte" : ISODate("2016-02-06T15:34:59.019Z")
                    }
                },
                {
                    "$text" : {
                        "$search" : "raven",
                        "$language" : ""
                    }
                }
            ]
        },
        "winningPlan" : {
            "stage" : "FETCH",
            "filter" : {
                "created" : {
                    "$lte" : ISODate("2016-02-06T15:34:59.019Z")
                }
            },
            "inputStage" : {
                "stage" : "TEXT",
                "indexPrefix" : {

                },
                "indexName" : "$**_text",
                "parsedTextQuery" : {

                }
            }
        },
        "rejectedPlans" : [ ]
    },
    "serverInfo" : {
        "host" : "Plod",
        "port" : 27017,
        "version" : "3.0.7",
        "gitVersion" : "6ce7cbe8c6b899552dadd907604559806aa2e9bd"
    },
    "ok" : 1
}    

Shouldn't the created date search reduce the number of documents, so speeding up the query?

Whilst the documents aren't tiny, they aren't massive either. Here's an example document:

{
    "_id" : ObjectId("56b612a2b6c13d2bec221d22"),
    "created" : ISODate("2016-02-06T15:34:57.954Z"),
    "adoptability-integer" : 1885631649,
    "impoverisher-double" : 0.78982932576436,
    "auriga-short-string" : "unpunished",
    "pistillate-long-string" : "raven nationalistic supergalaxies shit candidacy vengefulness baghla inharmony breviaries subcoracoid facet numbles Achaian hyksos g¥ᄀtterdï¿¥ï¾ exsecant costliness assertively cufic neurotomy subfebrile reassess eruption calciphobous epithecium adipopectic eruption neurotomy impaste shrugging oxytone depredating abb¥ᄑ unfaithfulness clive amman meteorology dollond del cussed malversation Determinateness wadset busher precedent warder lithest tuberculinize kythera swiping hyperopic installation otosclerosis costly joyance neenah saliently bicepses myograph blackmur. salable radiational copaiva seisure animism franglais chalkboard astride preaortic machinelike criseyde easternmost theological. goloshes amber assertively universalism pterylological abortifacient entrepï¾¢t nordic intricate canvasser unscholastic caria marginal prakritic gal tambur seascouting branchiform vaticide hysteroidal. vario chefoo permanganic solidillu lashings permanganic denatured chartres Nonenergetically pabx coinheritance koulibiaca wrathless unrejoicing kodly confutable Juru changelessness ratite pol lightener pansy portadown unpeg iontophoresis Ruddily overcorrupt rondure midair mobocrat. Rals sind teaser hussism definiteness piperidine septicity procryptic salicaceous catalpa Stingy panegyrise Baddie wodan preoccasioned ndebele sanitizing mulga grantedly selectman dep overscruple mealies subsellia noncompressible lepidoptera nonequilateral vï¿¥ï¾ racemiform carob preaccredit parramatta. piatigorsky unmanifest eulogized bolometric circumnavigating stare. prewitt branchiform canadianizing untinselled crossruf anthozoic del dragrope pronative foulness incessancy sultanate debunker guncotton reindictment uninstalled pieter buying prestwick anguish dicrotism permissible. nonscarcity labialising underswamp nondegradation incubating unwillable dealer Rewinded jaggedness jasmine flatfootedness edgily choregraphic unpenetrating unwhited devotedly thornton irremediably reentry cordilleras inhospitable blenchingly hedgehop. nontribesman semiexhibitionist streetlike outgeneral Spatiality hyacinthides prometheus tingly tenacious Aerologist promonarchy nonsophistical uhuru unsprayable countrywoman proequality schickard. antagonize Cart undocumented heteroplastic cyclostome keratin specification tombless lambie extricating feticide reacceded redwing autokinetic ferias underpart dupr¥ᄑ preexperimental besancon dvm riksm'' unharmonised bradykinetic unforeseeableness ryukyu rootstalk aquarial uredospore kame nondissenting pachyderm southeasterner comminute excitant torturing reasoningly restabilize isotopy emergency boathouses plowmanship decidedness skeptophylaxis kelebe clive furred abuttals variometer indamine wreathe. guymon rubinstein monotriglyph inaction. bedazzle foreordinated proportioner pursy beryl slogging forbearer abirritant concur. nonleprous veriax overservility mirza relitigate richness dipteroi mischarged. inquisitress nav unimpressibility teratoma brilliantined untensing vlaardingen theorbo shostakovich appia maximally fingered ashkenazim soap unpick isocheimenal gingili synonymical interannular patronising knaggiest cleaver lassie interwound osculated unobliging portobello boxer impactive.Bladderwort wish aerothermodynamics lymphadenomata nonfundamental interdiffuse injector chaussure. polyphyletically irishising ayous sinecurist decant carbonized flickeringly stomatitic emily luteotrophin anginous Syllabic permeameter Carthal brachiator farinose justicelike azotized getaway electroencephalographically puglia unconfound appendiceal premedical vassal rubric overhearing Conative heartaching shammer staphylorrhaphy bulgar spilikin phagocytosing adenitis syntypic dissertate collyrium sonless anoxia archil mimosis irreversibly unhabituated scholiast rcs portadown mishima preimport bonavist jointedly aspergillus farinose condemnation chough blanc descanter mephistopheles ongoing unsurgical unclassifiableness namtar corniest disbudding disklike zap wheyface teetotally nonsubmission delian enrober canadian nasi hypermetabolism animadversion Unbantering recompile ineradicable blindly mren Schorlaceous viperous latish unstationed decastylos catalpa beflagged pellicular demark gassendi. macmonnies deserve subsidizer generous reassess colorfully unsummonable clave hderlin borges aechmagoras misbegotten uncontradictory unfelicitous plunderage presynsacral backband amagasaki unsavorily proenzyme ney slipslop unrhythmical Debenture rosy unreprehended sulfuryl outpeep fichtean jellylike anginous foil pixies columella nonsuggestion unwhited icier archbishop masan oireachtas coxcomb pseudosiphonic rubinstein cockerel fidel swingle submembranous despondent sarajevo camshaft inclusiveness reynard deducibility Counselling velveteen whaleback interventricular harquebuses sodomite chunk nondecayed disyllable nonfundamental funnelling pricing neuroanatomical evaporate palisades kamerun. zigzag meteorology agura puerperium misfield annulus sapper franklinton prenotion pyroxylin dustour fluming cereus nontangental metempirical Nonadjudication restated impactive.Bladderwort swingle frolic hadramaut buraydah uncarbonized sthenius uncreditableness undreading grattoir excitant bma mellers centurial broad intellectualist pursy apodemal inclusiveness laurence kentucky cyanic nonunified jason. swiping mismatch cereuses dress entrain mannikin insetting scratchy glaiketness query antipatriarchal rjcharging. fichtean lwm reidentification theurgically Baddie abut snowcreep vaud cretheus clubhouses homodyne rayah beguine coquettishly rabidness retime lithoid send epistyle undefendable christless narcomania extraprofessional paracelsic interrogatories eucrite cotswolds reverberantly recommendatory dorsally wobbly sheared malacca worminess oka railway farnham bendwise prediet bastioned tuberculocele deriver intelligential Cutty Artillerist calipatria torchier drillable currawong obviable remoteness. forte sentimentalist dealer nonempathically foreseeable talthybius reinjuries tannic hyperopic toolmaker pieridine noncontention panne baghla syndromic intermeasuring gait leaving osteoclasis. squillageed cadetship messieurs benet Player terseness chagrinning sterically birthmark subvertebral runesmith stomodaeum illiberalising sarmentose overlubricate weeds ecumenic unretaliative execrableness trichotomic schumann luxury nupe dirk ashkenazim zap iconoclast vulneraries pulaski hypergeusesthesia mismatch lymphangitides cubitus unpossessable rummage silviculturally bara quo. arizonian danielson granddaddy klemperer curling derivatively monadal ungrained counterproductive contendingly handled aegirine motilal. unfaulty anecdotal cyanate. bucolically leaving mephistopheles revibrated maculation glairier palmer harebell Laryngectomized primitivity.Mucous consensual comfortlessly slumberland preenrollment decastylos buying yggdrasil unslakable concordia Uprooter pï¾¢rto meloid. klemperer frambesia bohemia kruller carburettor limousin Accessarily debt ameliorate bootblack richardson salvarsan contumaciousness landscaper epigyny palisades redwing pyribenzamine totalitarian taxiplane aurum chasm criollos transannular friendship spitefully eliot Ennomus spessartine pomiferous Ethnarchy milkiness fractable amigen unexuberant.Repark clapt keyboard noninjurious unemotive corbelled sib sextet beheader appear kyathos cirrocumular semipagan teasingly coelenteron nihilism chitchatted dress bateau unrhythmical unimmerged lapelled archaeocyte depersonalise redispersal querurying memlinc strepitous. consociate dehypnotizing stardom novelise mimosa disklike invertase nonmarriageable agreeing tuberculinize graphologic paris hew airy outwove inconvenienced columella desc freight broodiest spermatorrhoea melodic rebeck. silverised jahangir everard foolishly gabby packer Mahound Emendatory infeasibility inkpot resubstantiated Isopectic revivifying crassulaceous unresigned.Greenboard hanyang guevara inspectable hyperbrachycephaly dicrotal armipotent dissever girdlelike alternator obs. heritable nondietetically sensationism medick chlorellaceous spotted flews mariner gait nontribesman unshrinkability regulated haunter sharer postliminy maeterlinck disaffiliating nonreflection disadvantageously creepy congenitalness puglia savanna. Codetta orb reenlightenment gen palaeozoology educatee niobous deject dysteleological pampre electroencephalographically harebrained execrableness achroite theorbo germinance anisocarpic jagellon antlia frenchiness splendid communalize andalusia unlofty archduchy apery forbade snit wintriest mendicity franglais depersonalise sibship unslapped totalitarian compatriot doll polkaed dyersville huntingdonshire loftily spectrality carafe gouverneur cureless unprecarious redevelop illiberalism. racialistic distributing cameo madrigalesque coalitionist snort cochleate overact ladysmith protostele Afforestation multimegaton proletarianness Amphithalamus abeokuta. Amerind subfreezing missilery secateurs. superstructure. chrysarobin seaworthiness snohomish necrobacillosis incinerated wrack sclaffer kamasutra postmyxedemic mortgagor impaste earliness underlapped bucktooth mortified birthmark unscrupled angiocardiography hemiacetal judgeless hussy channing reunified nondissenting hypercathartic vindicable unslapped extensionally lashings canniest cling motional homotherm overobesity clive retasting clipt rewound. unousted prosper australorp theocracy Interprofessional crocus Carthal unmoveable repouss¥ᄑ birthmark reasonableness wristwatch patronising g¥ᄀtterdï¿¥ï¾ jink vitus. stokes ultima phyllocladium mudfish trust caravaggio overtipple Amorphous Baddie milksopping mulier. indeciduate winkle acrimoniousness. cereuses altgeld gelatinoid contemporaneous traveling haphtaroth aet gogglebox nupe archeptolemus withdrawable Nonenergetically horsing coral. stint preludin keynoter cogitative persuadability godwin wardenry reborn patternless sorrentine vitria moror chumash nonguilt nonpacific realter regive unoratorial halothane skeptophylaxis quo. songfest Desperado mischarged. suberise teratoma apposer homoiothermal nonstyptical "
}

回答1:

The main case here is that a "text" search result is generally taking precedence over other filter conditions in the query, and as such it becomes necessary to "first" obtain results from the "text" component, and then basically "scan" for other conditions in the document.

This type of search can be difficult to optimise along with a "range" or any type of "inequality" match condition in conjuntion with the text search results, and is mostly due to how MongoDB handles this "special" index type.

For a short demonstration, consider the following basic setup:

db.texty.drop();

db.texty.insert([
    { "a": "a", "text": "something" },
    { "a": "b", "text": "something" },
    { "a": "b", "text": "nothing much" },
    { "a": "c", "text": "something" }
])

db.texty.createIndex({ "text": "text" })
db.texty.createIndex({ "a": 1 })

So if you wanted to look at this with a text search condition as well as a range consideration on the other field ( { "$lt": "c" } ), then you could handle as follows:

db.texty.find({ "a": { "$lt": "c" }, "$text": { "$search": "something" } }).explain()

With the explain output such as ( important part ):

           "winningPlan" : {
                    "stage" : "FETCH",
                    "filter" : {
                            "a" : {
                                    "$lt" : "c"
                            }
                    },
                    "inputStage" : {
                            "stage" : "TEXT",
                            "indexPrefix" : {

                            },
                            "indexName" : "text_text",
                            "parsedTextQuery" : {
                                    "terms" : [
                                            "someth"
                                    ],
                                    "negatedTerms" : [ ],
                                    "phrases" : [ ],
                                    "negatedPhrases" : [ ]
                            },
                            "inputStage" : {
                                    "stage" : "TEXT_MATCH",
                                    "inputStage" : {
                                            "stage" : "TEXT_OR",
                                            "inputStage" : {
                                                    "stage" : "IXSCAN",
                                                    "keyPattern" : {
                                                            "_fts" : "text",
                                                            "_ftsx" : 1
                                                    },
                                                    "indexName" : "text_text",
                                                    "isMultiKey" : true,
                                                    "isUnique" : false,
                                                    "isSparse" : false,
                                                    "isPartial" : false,
                                                    "indexVersion" : 1,
                                                    "direction" : "backward",
                                                    "indexBounds" : {

                                                    }
                                            }
                                    }
                            }
                    }
            },

Which is basically saying "first get me the text results and then filter those results fetched by the other condition". So clearly only the "text" index is being used here and then all the results it returns are subsequently being filtered by examining the content.

This is not optimal for two reasons, being that it may likely be that the data is best constrained by the "range" condition rather than the matches from the text search. Secondly, even though there is an index on the other data, it is not being used here for comparison. So rather the whole document is loaded for each result and the filter is tested.

You might then consider a "compound" index format here, and it would seem initially logical that if the "range" is more specific to selection, then include that as the prefixed order of the indexed keys:

db.texty.dropIndexes();
db.texty.createIndex({ "a": 1, "text": "text" })

But there is a catch here, since when you attempt to run the query again:

db.texty.find({ "a": { "$lt": "c" }, "$text": { "$search": "something" } })

It would result in an error:

Error: error: { "waitedMS" : NumberLong(0), "ok" : 0, "errmsg" : "error processing query: ns=test.textyTree: $and\n a $lt \"c\"\n TEXT : query=something, language=english, caseSensitive=0, diacriticSensitive=0, tag=NULL\nSort: {}\nProj: {}\n planner returned error: failed to use text index to satisfy $text query (if text index is compound, are equality predicates given for all prefix fields?)", "code" : 2 }

So even though that may seem "optimal", the way MongoDB processes the query ( and really index selection ) for the special "text" index, it is just not possible for this "exclusion" outside of the range to be possible.

You can however perform an "equality" match on this in a very efficient way:

db.texty.find({ "a": "b", "$text": { "$search": "something" } }).explain()

With the explain output:

           "winningPlan" : {
                    "stage" : "TEXT",
                    "indexPrefix" : {
                            "a" : "b"
                    },
                    "indexName" : "a_1_text_text",
                    "parsedTextQuery" : {
                            "terms" : [
                                    "someth"
                            ],
                            "negatedTerms" : [ ],
                            "phrases" : [ ],
                            "negatedPhrases" : [ ]
                    },
                    "inputStage" : {
                            "stage" : "TEXT_MATCH",
                            "inputStage" : {
                                    "stage" : "TEXT_OR",
                                    "inputStage" : {
                                            "stage" : "IXSCAN",
                                            "keyPattern" : {
                                                    "a" : 1,
                                                    "_fts" : "text",
                                                    "_ftsx" : 1
                                            },
                                            "indexName" : "a_1_text_text",
                                            "isMultiKey" : true,
                                            "isUnique" : false,
                                            "isSparse" : false,
                                            "isPartial" : false,
                                            "indexVersion" : 1,
                                            "direction" : "backward",
                                            "indexBounds" : {

                                            }
                                    }
                            }
                    }
            },

So the index is used and it can be shown to "pre-filter" the content provided to the text matching by the output of the other condition.

If indeed you keep the "prefix" to the index as the "text" field(s) to search however:

db.texty.dropIndexes();

db.texty.createIndex({ "text": "text", "a": 1 })

Then perform the search:

db.texty.find({ "a": { "$lt": "c" }, "$text": { "$search": "something" } }).explain()

Then you see a similar result to the above "equality" match:

            "winningPlan" : {
                    "stage" : "TEXT",
                    "indexPrefix" : {

                    },
                    "indexName" : "text_text_a_1",
                    "parsedTextQuery" : {
                            "terms" : [
                                    "someth"
                            ],
                            "negatedTerms" : [ ],
                            "phrases" : [ ],
                            "negatedPhrases" : [ ]
                    },
                    "inputStage" : {
                            "stage" : "TEXT_MATCH",
                            "inputStage" : {
                                    "stage" : "TEXT_OR",
                                    "filter" : {
                                            "a" : {
                                                    "$lt" : "c"
                                            }
                                    },
                                    "inputStage" : {
                                            "stage" : "IXSCAN",
                                            "keyPattern" : {
                                                    "_fts" : "text",
                                                    "_ftsx" : 1,
                                                    "a" : 1
                                            },
                                            "indexName" : "text_text_a_1",
                                            "isMultiKey" : true,
                                            "isUnique" : false,
                                            "isSparse" : false,
                                            "isPartial" : false,
                                            "indexVersion" : 1,
                                            "direction" : "backward",
                                            "indexBounds" : {

                                            }
                                    }
                            }
                    }
            },

The big differnce here from the first attempt being where filter is placed in the processing chain, indicating that whilst not a "prefix" match ( which is most optimal ), the content is indeed being scanned off of the index "before" being sent to the "text" stage.

So it is "pre-filtered" but not of course in the most optimal way, and this is due to the very nature of how the "text" index is used. So if you just considered the plain range on an index by itself:

db.texty.createIndex({ "a": 1 })
db.texty.find({ "a": { "$lt": "c" } }).explain()

Then the explain output:

            "winningPlan" : {
                    "stage" : "FETCH",
                    "inputStage" : {
                            "stage" : "IXSCAN",
                            "keyPattern" : {
                                    "a" : 1
                            },
                            "indexName" : "a_1",
                            "isMultiKey" : false,
                            "isUnique" : false,
                            "isSparse" : false,
                            "isPartial" : false,
                            "indexVersion" : 1,
                            "direction" : "forward",
                            "indexBounds" : {
                                    "a" : [
                                            "[\"\", \"c\")"
                                    ]
                            }
                    }
            },

Then that at least got the indexBounds to consider and only looked at that portion of the index that fell within those bounds.

So that's the differences here. Using a "compound" structure should save you some iteration cycles here by being able to narrow down the selection, but it still must scan all index entries to filter, and must of course not be the "prefix" element in the index unless you can use an equality match on it.

Without a compound structure in the index, you are always returning the text results "first", and then applying any other conditions to those results. Also it is not possible to "combine/intersect" the results from looking at a "text" index and a "normal" index due to the query engine handling. That is generally not going to be the optimal approach, so planning for considerations is important.

In short, ideally compound with an "equality" match "prefix", and if not then include in the index "after" the text definition.