I am looking to get a random record from a huge (100 million record) mongodb
.
What is the fastest and most efficient way to do so? The data is already there and there are no field in which I can generate a random number and obtain a random row.
Any suggestions?
My PHP/MongoDB sort/order by RANDOM solution. Hope this helps anyone.
Note: I have numeric ID's within my MongoDB collection that refer to a MySQL database record.
First I create an array with 10 randomly generated numbers
In my aggregation I use the $addField pipeline operator combined with $arrayElemAt and $mod (modulus). The modulus operator will give me a number from 0 - 9 which I then use to pick a number from the array with random generated numbers.
After that you can use the sort Pipeline.
What works efficiently and reliably is this:
Add a field called "random" to each document and assign a random value to it, add an index for the random field and proceed as follows:
Let's assume we have a collection of web links called "links" and we want a random link from it:
To ensure the same link won't pop up a second time, update its random field with a new random number:
Using Python (pymongo), the aggregate function also works.
This approach is a lot faster than running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.
Starting with the 3.2 release of MongoDB, you can get N random docs from a collection using the
$sample
aggregation pipeline operator:non of the solutions worked well for me. especially when there are many gaps and set is small. this worked very well for me(in php):
Here is a way using the default
ObjectId
values for_id
and a little math and logic.That's the general logic in shell representation and easily adaptable.
So in points:
Find the min and max primary key values in the collection
Generate a random number that falls between the timestamps of those documents.
Add the random number to the minimum value and find the first document that is greater than or equal to that value.
This uses "padding" from the timestamp value in "hex" to form a valid
ObjectId
value since that is what we are looking for. Using integers as the_id
value is essentially simplier but the same basic idea in the points.