How to optimize MongoDB query with both $gt and $l

2020-06-03 03:01发布

问题:

I have the following query that is kind of like a reverse range lookup:

db.ip_ranges.find({ $and: [{ start_ip_num: { $lte: 1204135028 } }, { end_ip_num: { $gt: 1204135028 } }] })

When run with only the $lte identifier, the query returns right away. But when I run with both the $gt and $lte in the same query, it is extremely slow (in seconds).

Both the start_ip_num and end_ip_num fields are indexed.

How can I go about optimizing this query?

EDIT

I get the following when I use the explain() function on the query:

{
    "cursor" : "BtreeCursor start_ip_num_1",
    "nscanned" : 452336,
    "nscannedObjects" : 452336,
    "n" : 1,
    "millis" : 2218,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "isMultiKey" : false,
    "indexOnly" : false,
    "indexBounds" : {
        "start_ip_num" : [
            [
                -1.7976931348623157e+308,
                1204135028
            ]
        ]
    }
}

EDIT 2

Once I added the compound index, the explain() function returns the following:

{
    "cursor" : "BtreeCursor start_ip_num_1_end_ip_num_1",
    "nscanned" : 431776,
    "nscannedObjects" : 1,
    "n" : 1,
    "millis" : 3433,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "isMultiKey" : false,
    "indexOnly" : false,
    "indexBounds" : {
        "start_ip_num" : [
            [
                -1.7976931348623157e+308,
                1204135028
            ]
        ],
        "end_ip_num" : [
            [
                1204135028,
                1.7976931348623157e+308
            ]
        ]
    }
}

However, the perf is still poor (in seconds).

回答1:

So, double range queries are ill-advised in Mongo. I assume you have a single index containing both {start_ip_num: 1, end_ip_num: 1}.

If that doesn't get you close enough (often it is still slow if you have enough data returned by the first field, since it has to do a lot of B-tree scanning), there is one trick you can do to combat this using 2D box queries (only works for two ranges at a time).

Basically, you put a 2D geo index on a field containing the two points in an array, like [start_ip, end_ip], and give it a high enough min/max value so that it won't hit the limits which are by default just -180/180.

Finally, use a bounds query with the range going from min to the $lte value on one corner of the box, and the gt and the max value on the other corner of the box. See http://www.mongodb.org/display/DOCS/Geospatial+Indexing#GeospatialIndexing-BoundsQueries for syntax.

It looks something like this:

db.ip_ranges.find({ip_range:{$within:{$box:[[0, 1204135028], [1204135028, max]]}}});

where max is the biggest ip you can have.

It's been a while since I've looked a this, so the box might be wrong, but the concept is sound, and it made the double range queries perform a bit better than with a regular two field B-tree index. Consistently under a second (although usually a few hundred ms), compared to a few seconds with the regular index--I think I had hundreds of millions of docs at the time, but it's been a while so take these remembered benchmarks with a grain of salt. Results will vary greatly depending on your data and range sizes, I'm sure.

Update: You might want to experiment with the bits setting, trying a low number and a high number to see if it makes a difference. For me, it didn't seem to affect the queries on average. See http://www.mongodb.org/display/DOCS/Geospatial+Indexing#GeospatialIndexing-CreatingtheIndex for syntax.



回答2:

According to the Ip2location website, one can achieve fast queries on ip addresses with mongodb without the range query. Create just one index on mongodb { ip_to: 1 }, and query the ip with:

db.collection_name.find({ ip_to: { $gte : ip_integer } }).sort({ ip_end: 1 }).limit(1)

With this config I got 1ms query time with a 6 million document collection.



回答3:

The trick is to use a $lte and a sort. I got the query down to a few ms.

I had the exact same problem - finding which CIDR block matched a particular IP address. I also tried using $gte and $lte and was getting 10 second response times.

I solved the problem in a different way. Note that the CIDR blocks (the IP address ranges) in the MaxMind database don't overlap. Each IP address will at most match one result. So all you need to do is find the CIDR block with the largest start_ip_num that is less than the particular IP address. Then verify in application code that the end_ip_num is greater than the particular IP address.

Here's the code (using the node MongoDB client):

// Convert IP address to base 10.
var ipToDecimal = function (ipAddress) {
  var split = ipAddress.split('.');
  return (split[0] * 16777216) + (split[1] * 65536) + (split[2] * 256) + (+split[3]);
};

var ipAddress = '1.2.3.4';
var ipDecimal = ipToDecimal(ipAddress);

db.ip_addresses.find({start_ip_num: {$lte: ipDecimal}}, {_id: 0, country_code: 1, end_ip_num: 1}, {limit: 1, sort: [['start_ip_num', -1]]}).toArray(function (error, ipAddresses) {
  if (ipAddresses[0] && ipAddresses[0]['end_ip_num'] >= ipDecimal) {
    console.log('IP address found: ', ipAddresses[0]['country_code']);
  } else {
    console.log('IP address not found.');
  }
});

Be sure to create an index on start_ip_num.



回答4:

After a ton of experimenting and researching, I came across this:

https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/IUwOzWsc0Sg

I'm able to get the query down around 200-300ms with this query, AND dropping all indexes (You have to drop all the indexes for this to work!!!):

db.ip_ranges.find({start_ip_num: {$lte: 1204135028}, end_ip_num: {$gt: 1204135028}}).limit(1)

Don't ask me why. I can't explain it. If you're interested, I was building the GeoIP database from MaxMind with MongoDB.