Random record from MongoDB

2018-12-31 04:39发布

I am looking to get a random record from a huge (100 million record) mongodb.

What is the fastest and most efficient way to do so? The data is already there and there are no field in which I can generate a random number and obtain a random row.

Any suggestions?

标签: mongodb
25条回答
孤独总比滥情好
2楼-- · 2018-12-31 04:57

In Python using pymongo:

import random

def get_random_doc():
    count = collection.count()
    return collection.find()[random.randrange(count)]
查看更多
零度萤火
3楼-- · 2018-12-31 04:58

You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number.

First, enable geospatial indexing on a collection:

db.docs.ensureIndex( { random_point: '2d' } )

To create a bunch of documents with random points on the X-axis:

for ( i = 0; i < 10; ++i ) {
    db.docs.insert( { key: i, random_point: [Math.random(), 0] } );
}

Then you can get a random document from the collection like this:

db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } )

Or you can retrieve several document nearest to a random point:

db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 )

This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.

查看更多
泪湿衣
4楼-- · 2018-12-31 04:58

The following recipe is a little slower than the mongo cookbook solution (add a random key on every document), but returns more evenly distributed random documents. It's a little less-evenly distributed than the skip( random ) solution, but much faster and more fail-safe in case documents are removed.

function draw(collection, query) {
    // query: mongodb query object (optional)
    var query = query || { };
    query['random'] = { $lte: Math.random() };
    var cur = collection.find(query).sort({ rand: -1 });
    if (! cur.hasNext()) {
        delete query.random;
        cur = collection.find(query).sort({ rand: -1 });
    }
    var doc = cur.next();
    doc.random = Math.random();
    collection.update({ _id: doc._id }, doc);
    return doc;
}

It also requires you to add a random "random" field to your documents so don't forget to add this when you create them : you may need to initialize your collection as shown by Geoffrey

function addRandom(collection) { 
    collection.find().forEach(function (obj) {
        obj.random = Math.random();
        collection.save(obj);
    }); 
} 
db.eval(addRandom, db.things);

Benchmark results

This method is much faster than the skip() method (of ceejayoz) and generates more uniformly random documents than the "cookbook" method reported by Michael:

For a collection with 1,000,000 elements:

  • This method takes less than a millisecond on my machine

  • the skip() method takes 180 ms on average

The cookbook method will cause large numbers of documents to never get picked because their random number does not favor them.

  • This method will pick all elements evenly over time.

  • In my benchmark it was only 30% slower than the cookbook method.

  • the randomness is not 100% perfect but it is very good (and it can be improved if necessary)

This recipe is not perfect - the perfect solution would be a built-in feature as others have noted.
However it should be a good compromise for many purposes.

查看更多
浪荡孟婆
5楼-- · 2018-12-31 04:59

In order to get a determinated number of random docs without duplicates:

  1. first get all ids
  2. get size of documents
  3. loop geting random index and skip duplicated

    number_of_docs=7
    db.collection('preguntas').find({},{_id:1}).toArray(function(err, arr) {
    count=arr.length
    idsram=[]
    rans=[]
    while(number_of_docs!=0){
        var R = Math.floor(Math.random() * count);
        if (rans.indexOf(R) > -1) {
         continue
          } else {           
                   ans.push(R)
                   idsram.push(arr[R]._id)
                   number_of_docs--
                    }
        }
    db.collection('preguntas').find({}).toArray(function(err1, doc1) {
                    if (err1) { console.log(err1); return;  }
                   res.send(doc1)
                });
            });
    
查看更多
姐姐魅力值爆表
6楼-- · 2018-12-31 04:59

I would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.

function mapf() {
    if(Math.random() <= probability) {
    emit(1, this);
    }
}

function reducef(key,values) {
    return {"documents": values};
}

res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": { "probability": 0.5}});
printjson(res.results);

The reducef function above works because only one key ('1') is emitted from the map function.

The value of the "probability" is defined in the "scope", when invoking mapRreduce(...)

Using mapReduce like this should also be usable on a sharded db.

If you want to select exactly n of m documents from the db, you could do it like this:

function mapf() {
    if(countSubset == 0) return;
    var prob = countSubset / countTotal;
    if(Math.random() <= prob) {
        emit(1, {"documents": [this]}); 
        countSubset--;
    }
    countTotal--;
}

function reducef(key,values) {
    var newArray = new Array();
for(var i=0; i < values.length; i++) {
    newArray = newArray.concat(values[i].documents);
}

return {"documents": newArray};
}

res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": {"countTotal": 4, "countSubset": 2}})
printjson(res.results);

Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.

This approach might give some problems on sharded databases.

查看更多
永恒的永恒
7楼-- · 2018-12-31 05:01

I'd suggest adding a random int field to each object. Then you can just do a

findOne({random_field: {$gte: rand()}}) 

to pick a random document. Just make sure you ensureIndex({random_field:1})

查看更多
登录 后发表回答