Efficiently sorting the results of a mongodb geosp

2020-07-18 09:14发布

问题:

I have a very large collection of documents like:

{ loc: [10.32, 24.34], relevance: 0.434 }

and want to be able efficiently do a query like:

 { "loc": {"$geoWithin":{"$box":[[-103,10.1],[-80.43,30.232]]}} }

with arbitrary boxes.

Adding an 2d index on loc makes this very fast and efficient. However, I want to now also just get the most relevant documents:

.sort({ relevance: -1 })

Which causes everything to grind to a crawl (there can be huge amount of results in any particular box, and I just need the top 10 or so).

Any advise or help greatly appreciated!!

回答1:

Have you tried using the aggregation framework?

A two stage pipeline might work:

  1. a $match stage that uses your existing $geoWithin query.
  2. a $sort stage that sorts by relevance: -1

Here's an example of what it might look like:

db.foo.aggregate(
    {$match: { "loc": {"$geoWithin":{"$box":[[-103,10.1],[-80.43,30.232]]}} }},
    {$sort: {relevance: -1}}
);

I'm not sure how it will perform. However, even if it's poor with MongoDB 2.4, it might be dramatically different in 2.6/2.5, as 2.6 will include improved aggregation sort performance.



回答2:

When there is a huge result matching particular box, sort operation is really expensive so that you definitely want to avoid it. Try creating separate index on relevance field and try using it (without 2d index at all): the query will be executed much more efficiently that way - documents (already sorted by relevance) will be scanned one by one matching the given geo box condition. When top 10 are found, you're good.

It might not be that fast if geo box matches only small subset of the collection, though. In worst case scenario it will need to scan through the whole collection.

I suggest you to create 2 indexes (loc vs. relevance) and run tests on queries which are common in your app (using mongo's hint to force using needed index).

Depending on your tests results, you may even want to add some app logic so that if you know the box is huge you can run the query with relevance index, otherwise use loc 2d index. Just a thought.



回答3:

You cannot have the scan and order value as 0 when you trying to use to have sorting on the part of a compound key. Unfortunately currently there is no solution for your problem which is not related to the phenomenon that you are using a 2d index or else.

When you run an explain command on your query the value of "scanAndOrder" show weather it was needed to have a sorting phase after collecting the result or not.If it is true a sorting after the querying was needed, if it is false sorting was not needed.

To test the situation i created a collection called t2 in a sample db this way:

db.createCollection('t2')
db.t2.ensureIndex({a:1})
db.t2.ensureIndex({b:1})
db.t2.ensureIndex({a:1,b:1})
db.t2.ensureIndex({b:1,a:1})

for(var i=0;i++<200;){db.t2.insert({a:i,b:i+2})}

While you can use only 1 index to support a query i did the following test with the results included:

mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("b_1").explain()
{
    "cursor" : "BtreeCursor b_1",
    "isMultiKey" : false,
    "n" : 150,
    "nscannedObjects" : 200,
    "nscanned" : 200,
    "nscannedObjectsAllPlans" : 200,
    "nscannedAllPlans" : 200,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
        "b" : [
            [
                {
                    "$minElement" : 1
                },
                {
                    "$maxElement" : 1
                }
            ]
        ]
    },
    "server" : "localhost:27418",
    "millis" : 0
}
mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("a_1_b_1").explain()
{
    "cursor" : "BtreeCursor a_1_b_1",
    "isMultiKey" : false,
    "n" : 150,
    "nscannedObjects" : 150,
    "nscanned" : 150,
    "nscannedObjectsAllPlans" : 150,
    "nscannedAllPlans" : 150,
    "scanAndOrder" : true,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 1,
    "indexBounds" : {
        "a" : [
            [
                50,
                1.7976931348623157e+308
            ]
        ],
        "b" : [
            [
                {
                    "$minElement" : 1
                },
                {
                    "$maxElement" : 1
                }
            ]
        ]
    },
    "server" : "localhost:27418",
    "millis" : 1
}
mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("a_1").explain()
{
    "cursor" : "BtreeCursor a_1",
    "isMultiKey" : false,
    "n" : 150,
    "nscannedObjects" : 150,
    "nscanned" : 150,
    "nscannedObjectsAllPlans" : 150,
    "nscannedAllPlans" : 150,
    "scanAndOrder" : true,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 1,
    "indexBounds" : {
        "a" : [
            [
                50,
                1.7976931348623157e+308
            ]
        ]
    },
    "server" : "localhost:27418",
    "millis" : 1
}


 mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("b_1_a_1").explain()
{
    "cursor" : "BtreeCursor b_1_a_1",
    "isMultiKey" : false,
    "n" : 150,
    "nscannedObjects" : 150,
    "nscanned" : 198,
    "nscannedObjectsAllPlans" : 150,
    "nscannedAllPlans" : 198,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
        "b" : [
            [
                {
                    "$minElement" : 1
                },
                {
                    "$maxElement" : 1
                }
            ]
        ],
        "a" : [
            [
                50,
                1.7976931348623157e+308
            ]
        ]
    },
    "server" : "localhost:27418",
    "millis" : 0
}

The indexes on individual fields does not help much so a_1 (not support sorting) and b_1 (not support queryin) is out . The index on a_1_b_1 also not fortunate while it will perform worse than the single a_1, mongoDB engine will not utilize the situation that the part related to one 'a' value stored ordered this way. What is worth to try is a compound index b_1_a_1 which in your case relevance_1_loc_1 while it will return the results in ordered manner so scanAndOrder will be false and i have not tested for 2d index but i assume it will exclude scanning some documents based on just the index value (that is why in the test in that case the nscanned is higher than nscannedObjects). The index unfortunately will be huge but still smaller than the docs.



回答4:

This solution is valid if you need to search inside a box(rectangle).

The problem with geospatial index is that you can only place it in the front of a Compound Index (at least it is so for mongo 3.2)

So I thought why not to create my own "geospatial" index? All I need is to create a Compound Index on Lat, Lgn (X,Y) and add the sort field at the first place. Then I'll need to implement the logic of searching inside the box boundaries and specifically instruct mongo to use it (hint).

Translating to your problem:

db.collection.createIndex({ "relevance": 1, "loc_x": 1, "loc_y": 1 }, { "background": true } )

Logic:

db.collection.find({
    "loc_x": { "$gt": -103, "$lt": -80.43 },
    "loc_y": { "$gt": 10.1, "$lt": 30.232 }
}).hint("relevance_1_loc_x_1_loc_y_1") // or whatever name you gave it

Use $gte and $lte if you need inclusive results.

And you don't need to use .sort() since it's already sorted, or you can do a reverse sort on relevance if you need.

The only issue that I encountered with it is when the box area is small. It took more time to find small areas than large ones. That is why I kept the geospatial index for small area searches.



标签: mongodb