Random record from MongoDB

2018-12-31 04:39发布

I am looking to get a random record from a huge (100 million record) mongodb.

What is the fastest and most efficient way to do so? The data is already there and there are no field in which I can generate a random number and obtain a random row.

Any suggestions?

标签: mongodb
25条回答
情到深处是孤独
2楼-- · 2018-12-31 05:16

My PHP/MongoDB sort/order by RANDOM solution. Hope this helps anyone.

Note: I have numeric ID's within my MongoDB collection that refer to a MySQL database record.

First I create an array with 10 randomly generated numbers

    $randomNumbers = [];
    for($i = 0; $i < 10; $i++){
        $randomNumbers[] = rand(0,1000);
    }

In my aggregation I use the $addField pipeline operator combined with $arrayElemAt and $mod (modulus). The modulus operator will give me a number from 0 - 9 which I then use to pick a number from the array with random generated numbers.

    $aggregate[] = [
        '$addFields' => [
            'random_sort' => [ '$arrayElemAt' => [ $randomNumbers, [ '$mod' => [ '$my_numeric_mysql_id', 10 ] ] ] ],
        ],
    ];

After that you can use the sort Pipeline.

    $aggregate[] = [
        '$sort' => [
            'random_sort' => 1
        ]
    ];
查看更多
皆成旧梦
3楼-- · 2018-12-31 05:16

What works efficiently and reliably is this:

Add a field called "random" to each document and assign a random value to it, add an index for the random field and proceed as follows:

Let's assume we have a collection of web links called "links" and we want a random link from it:

link = db.links.find().sort({random: 1}).limit(1)[0]

To ensure the same link won't pop up a second time, update its random field with a new random number:

db.links.update({random: Math.random()}, link)
查看更多
ら面具成の殇う
4楼-- · 2018-12-31 05:18

Using Python (pymongo), the aggregate function also works.

collection.aggregate([{'$sample': {'size': sample_size }}])

This approach is a lot faster than running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.

查看更多
孤独寂梦人
5楼-- · 2018-12-31 05:20

Starting with the 3.2 release of MongoDB, you can get N random docs from a collection using the $sample aggregation pipeline operator:

// Get one random document from the mycoll collection.
db.mycoll.aggregate([{ $sample: { size: 1 } }])
查看更多
怪性笑人.
6楼-- · 2018-12-31 05:22

non of the solutions worked well for me. especially when there are many gaps and set is small. this worked very well for me(in php):

$count = $collection->count($search);
$skip = mt_rand(0, $count - 1);
$result = $collection->find($search)->skip($skip)->limit(1)->getNext();
查看更多
墨雨无痕
7楼-- · 2018-12-31 05:23

Here is a way using the default ObjectId values for _id and a little math and logic.

// Get the "min" and "max" timestamp values from the _id in the collection and the 
// diff between.
// 4-bytes from a hex string is 8 characters

var min = parseInt(db.collection.find()
        .sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    max = parseInt(db.collection.find()
        .sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    diff = max - min;

// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;

// Use "random" in the range and pad the hex string to a valid ObjectId
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")

// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
   .sort({ "_id": 1 }).limit(1).toArray()[0];

That's the general logic in shell representation and easily adaptable.

So in points:

  • Find the min and max primary key values in the collection

  • Generate a random number that falls between the timestamps of those documents.

  • Add the random number to the minimum value and find the first document that is greater than or equal to that value.

This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.

查看更多
登录 后发表回答