Random record from MongoDB

2018-12-31 04:39发布

I am looking to get a random record from a huge (100 million record) mongodb.

What is the fastest and most efficient way to do so? The data is already there and there are no field in which I can generate a random number and obtain a random row.

Any suggestions?

标签: mongodb
25条回答
其实,你不懂
2楼-- · 2018-12-31 05:08

When I was faced with a similar solution, I backtracked and found that the business request was actually for creating some form of rotation of the inventory being presented. In that case, there are much better options, which have answers from search engines like Solr, not data stores like MongoDB.

In short, with the requirement to "intelligently rotate" content, what we should do instead of a random number across all of the documents is to include a personal q score modifier. To implement this yourself, assuming a small population of users, you can store a document per user that has the productId, impression count, click-through count, last seen date, and whatever other factors the business finds as being meaningful to compute a q score modifier. When retrieving the set to display, typically you request more documents from the data store than requested by the end user, then apply the q score modifier, take the number of records requested by the end user, then randomize the page of results, a tiny set, so simply sort the documents in the application layer (in memory).

If the universe of users is too large, you can categorize users into behavior groups and index by behavior group rather than user.

If the universe of products is small enough, you can create an index per user.

I have found this technique to be much more efficient, but more importantly more effective in creating a relevant, worthwhile experience of using the software solution.

查看更多
看风景的人
3楼-- · 2018-12-31 05:08

Using Map/Reduce, you can certainly get a random record, just not necessarily very efficiently depending on the size of the resulting filtered collection you end up working with.

I've tested this method with 50,000 documents (the filter reduces it to about 30,000), and it executes in approximately 400ms on an Intel i3 with 16GB ram and a SATA3 HDD...

db.toc_content.mapReduce(
    /* map function */
    function() { emit( 1, this._id ); },

    /* reduce function */
    function(k,v) {
        var r = Math.floor((Math.random()*v.length));
        return v[r];
    },

    /* options */
    {
        out: { inline: 1 },
        /* Filter the collection to "A"ctive documents */
        query: { status: "A" }
    }
);

The Map function simply creates an array of the id's of all documents that match the query. In my case I tested this with approximately 30,000 out of the 50,000 possible documents.

The Reduce function simply picks a random integer between 0 and the number of items (-1) in the array, and then returns that _id from the array.

400ms sounds like a long time, and it really is, if you had fifty million records instead of fifty thousand, this may increase the overhead to the point where it becomes unusable in multi-user situations.

There is an open issue for MongoDB to include this feature in the core... https://jira.mongodb.org/browse/SERVER-533

If this "random" selection was built into an index-lookup instead of collecting ids into an array and then selecting one, this would help incredibly. (go vote it up!)

查看更多
其实,你不懂
4楼-- · 2018-12-31 05:09

You can pick random _id and return corresponding object:

 db.collection.count( function(err, count){
        db.collection.distinct( "_id" , function( err, result) {
            if (err)
                res.send(err)
            var randomId = result[Math.floor(Math.random() * (count-1))]
            db.collection.findOne( { _id: randomId } , function( err, result) {
                if (err)
                    res.send(err)
                console.log(result)
            })
        })
    })

Here you dont need to spend space on storing random numbers in collection.

查看更多
流年柔荑漫光年
5楼-- · 2018-12-31 05:12

If you are using mongoose then you may use mongoose-random mongoose-random

查看更多
看淡一切
6楼-- · 2018-12-31 05:12

This works nice, it's fast, works with multiple documents and doesn't require populating rand field, which will eventually populate itself:

  1. add index to .rand field on your collection
  2. use find and refresh, something like:
// Install packages:
//   npm install mongodb async
// Add index in mongo:
//   db.ensureIndex('mycollection', { rand: 1 })

var mongodb = require('mongodb')
var async = require('async')

// Find n random documents by using "rand" field.
function findAndRefreshRand (collection, n, fields, done) {
  var result = []
  var rand = Math.random()

  // Append documents to the result based on criteria and options, if options.limit is 0 skip the call.
  var appender = function (criteria, options, done) {
    return function (done) {
      if (options.limit > 0) {
        collection.find(criteria, fields, options).toArray(
          function (err, docs) {
            if (!err && Array.isArray(docs)) {
              Array.prototype.push.apply(result, docs)
            }
            done(err)
          }
        )
      } else {
        async.nextTick(done)
      }
    }
  }

  async.series([

    // Fetch docs with unitialized .rand.
    // NOTE: You can comment out this step if all docs have initialized .rand = Math.random()
    appender({ rand: { $exists: false } }, { limit: n - result.length }),

    // Fetch on one side of random number.
    appender({ rand: { $gte: rand } }, { sort: { rand: 1 }, limit: n - result.length }),

    // Continue fetch on the other side.
    appender({ rand: { $lt: rand } }, { sort: { rand: -1 }, limit: n - result.length }),

    // Refresh fetched docs, if any.
    function (done) {
      if (result.length > 0) {
        var batch = collection.initializeUnorderedBulkOp({ w: 0 })
        for (var i = 0; i < result.length; ++i) {
          batch.find({ _id: result[i]._id }).updateOne({ rand: Math.random() })
        }
        batch.execute(done)
      } else {
        async.nextTick(done)
      }
    }

  ], function (err) {
    done(err, result)
  })
}

// Example usage
mongodb.MongoClient.connect('mongodb://localhost:27017/core-development', function (err, db) {
  if (!err) {
    findAndRefreshRand(db.collection('profiles'), 1024, { _id: true, rand: true }, function (err, result) {
      if (!err) {
        console.log(result)
      } else {
        console.error(err)
      }
      db.close()
    })
  } else {
    console.error(err)
  }
})

ps. How to find random records in mongodb question is marked as duplicate of this question. The difference is that this question asks explicitly about single record as the other one explicitly about getting random documents.

查看更多
爱死公子算了
7楼-- · 2018-12-31 05:14

Update for MongoDB 3.2

3.2 introduced $sample to the aggregation pipeline.

There's also a good blog post on putting it into practice.

For older versions (previous answer)

This was actually a feature request: http://jira.mongodb.org/browse/SERVER-533 but it was filed under "Won't fix."

The cookbook has a very good recipe to select a random document out of a collection: http://cookbook.mongodb.org/patterns/random-attribute/

To paraphrase the recipe, you assign random numbers to your documents:

db.docs.save( { key : 1, ..., random : Math.random() } )

Then select a random document:

rand = Math.random()
result = db.docs.findOne( { key : 2, random : { $gte : rand } } )
if ( result == null ) {
  result = db.docs.findOne( { key : 2, random : { $lte : rand } } )
}

Querying with both $gte and $lte is necessary to find the document with a random number nearest rand.

And of course you'll want to index on the random field:

db.docs.ensureIndex( { key : 1, random :1 } )

If you're already querying against an index, simply drop it, append random: 1 to it, and add it again.

查看更多
登录 后发表回答