Random record from MongoDB

I am looking to get a random record from a huge (100 million record) mongodb.

What is the fastest and most efficient way to do so? The data is already there and there are no field in which I can generate a random number and obtain a random row.

Any suggestions?

标签： mongodb

25条回答

孤独总比滥情好

2楼-- · 2018-12-31 04:57

In Python using pymongo:

import random

def get_random_doc():
    count = collection.count()
    return collection.find()[random.randrange(count)]

0人赞添加讨论(0) 举报

零度萤火

3楼-- · 2018-12-31 04:58

You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number.

First, enable geospatial indexing on a collection:

db.docs.ensureIndex( { random_point: '2d' } )

To create a bunch of documents with random points on the X-axis:

for ( i = 0; i < 10; ++i ) {
    db.docs.insert( { key: i, random_point: [Math.random(), 0] } );
}

Then you can get a random document from the collection like this:

db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } )

Or you can retrieve several document nearest to a random point:

db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 )

This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.

0人赞添加讨论(0) 举报

泪湿衣

4楼-- · 2018-12-31 04:58

The following recipe is a little slower than the mongo cookbook solution (add a random key on every document), but returns more evenly distributed random documents. It's a little less-evenly distributed than the skip( random ) solution, but much faster and more fail-safe in case documents are removed.

function draw(collection, query) {
    // query: mongodb query object (optional)
    var query = query || { };
    query['random'] = { $lte: Math.random() };
    var cur = collection.find(query).sort({ rand: -1 });
    if (! cur.hasNext()) {
        delete query.random;
        cur = collection.find(query).sort({ rand: -1 });
    }
    var doc = cur.next();
    doc.random = Math.random();
    collection.update({ _id: doc._id }, doc);
    return doc;
}

It also requires you to add a random "random" field to your documents so don't forget to add this when you create them : you may need to initialize your collection as shown by Geoffrey

function addRandom(collection) { 
    collection.find().forEach(function (obj) {
        obj.random = Math.random();
        collection.save(obj);
    }); 
} 
db.eval(addRandom, db.things);

Benchmark results

This method is much faster than the skip() method (of ceejayoz) and generates more uniformly random documents than the "cookbook" method reported by Michael:

For a collection with 1,000,000 elements:

This method takes less than a millisecond on my machine
the skip() method takes 180 ms on average

The cookbook method will cause large numbers of documents to never get picked because their random number does not favor them.

This method will pick all elements evenly over time.
In my benchmark it was only 30% slower than the cookbook method.
the randomness is not 100% perfect but it is very good (and it can be improved if necessary)

This recipe is not perfect - the perfect solution would be a built-in feature as others have noted.
However it should be a good compromise for many purposes.

0人赞添加讨论(0) 举报

浪荡孟婆

5楼-- · 2018-12-31 04:59

In order to get a determinated number of random docs without duplicates:

first get all ids
get size of documents

loop geting random index and skip duplicated

number_of_docs=7
db.collection('preguntas').find({},{_id:1}).toArray(function(err, arr) {
count=arr.length
idsram=[]
rans=[]
while(number_of_docs!=0){
    var R = Math.floor(Math.random() * count);
    if (rans.indexOf(R) > -1) {
     continue
      } else {           
               ans.push(R)
               idsram.push(arr[R]._id)
               number_of_docs--
                }
    }
db.collection('preguntas').find({}).toArray(function(err1, doc1) {
                if (err1) { console.log(err1); return;  }
               res.send(doc1)
            });
        });

0人赞添加讨论(0) 举报

姐姐魅力值爆表

6楼-- · 2018-12-31 04:59

I would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.

function mapf() {
    if(Math.random() <= probability) {
    emit(1, this);
    }
}

function reducef(key,values) {
    return {"documents": values};
}

res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": { "probability": 0.5}});
printjson(res.results);

The reducef function above works because only one key ('1') is emitted from the map function.

The value of the "probability" is defined in the "scope", when invoking mapRreduce(...)

Using mapReduce like this should also be usable on a sharded db.

If you want to select exactly n of m documents from the db, you could do it like this:

function mapf() {
    if(countSubset == 0) return;
    var prob = countSubset / countTotal;
    if(Math.random() <= prob) {
        emit(1, {"documents": [this]}); 
        countSubset--;
    }
    countTotal--;
}

function reducef(key,values) {
    var newArray = new Array();
for(var i=0; i < values.length; i++) {
    newArray = newArray.concat(values[i].documents);
}

return {"documents": newArray};
}

res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": {"countTotal": 4, "countSubset": 2}})
printjson(res.results);

Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.

This approach might give some problems on sharded databases.

0人赞添加讨论(0) 举报

永恒的永恒

7楼-- · 2018-12-31 05:01

I'd suggest adding a random int field to each object. Then you can just do a

findOne({random_field: {$gte: rand()}})

to pick a random document. Just make sure you ensureIndex({random_field:1})

0人赞添加讨论(0) 举报

1 2 3 4 5 下一页

Random record from MongoDB

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间