I am looking to get a random record from a huge (100 million record) mongodb
.
What is the fastest and most efficient way to do so? The data is already there and there are no field in which I can generate a random number and obtain a random row.
Any suggestions?
In Python using pymongo:
You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number.
First, enable geospatial indexing on a collection:
To create a bunch of documents with random points on the X-axis:
Then you can get a random document from the collection like this:
Or you can retrieve several document nearest to a random point:
This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.
The following recipe is a little slower than the mongo cookbook solution (add a random key on every document), but returns more evenly distributed random documents. It's a little less-evenly distributed than the
skip( random )
solution, but much faster and more fail-safe in case documents are removed.It also requires you to add a random "random" field to your documents so don't forget to add this when you create them : you may need to initialize your collection as shown by Geoffrey
Benchmark results
This method is much faster than the
skip()
method (of ceejayoz) and generates more uniformly random documents than the "cookbook" method reported by Michael:For a collection with 1,000,000 elements:
This method takes less than a millisecond on my machine
the
skip()
method takes 180 ms on averageThe cookbook method will cause large numbers of documents to never get picked because their random number does not favor them.
This method will pick all elements evenly over time.
In my benchmark it was only 30% slower than the cookbook method.
the randomness is not 100% perfect but it is very good (and it can be improved if necessary)
This recipe is not perfect - the perfect solution would be a built-in feature as others have noted.
However it should be a good compromise for many purposes.
In order to get a determinated number of random docs without duplicates:
loop geting random index and skip duplicated
I would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.
The reducef function above works because only one key ('1') is emitted from the map function.
The value of the "probability" is defined in the "scope", when invoking mapRreduce(...)
Using mapReduce like this should also be usable on a sharded db.
If you want to select exactly n of m documents from the db, you could do it like this:
Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.
This approach might give some problems on sharded databases.
I'd suggest adding a random int field to each object. Then you can just do a
to pick a random document. Just make sure you ensureIndex({random_field:1})