I'm in need of getting a random record from a table via ActiveRecord. I've followed the example from Jamis Buck from 2006.
However, I've also come across another way via a Google search (can't attribute with a link due to new user restrictions):
rand_id = rand(Model.count)
rand_record = Model.first(:conditions => ["id >= ?", rand_id])
I'm curious how others on here have done it or if anyone knows what way would be more efficient.
I try this of Sam's example on my App using rails 4.2.8 of Benchmark( I put 1..Category.count for random, because if the random takes a 0 it will produce an error(ActiveRecord::RecordNotFound: Couldn't find Category with 'id'=0)) and the mine was:
In Rails 4 and 5, using Postgresql or SQLite, using
RANDOM()
:Presumably the same would work for MySQL with
RAND()
This is about 2.5 times faster than the approach in the accepted answer.
Caveat: This is slow for large datasets with millions of records, so you might want to add a
limit
clause.If you're using PostgreSQL 9.5+, you can take advantage of
TABLESAMPLE
to select a random record.The two default sampling methods (
SYSTEM
andBERNOULLI
) require that you specify the number of rows to return as a percentage of the total number of rows in the table.This requires knowing the amount of records in the table to select the appropriate percentage, which may not be easy to find quickly. Fortunately, there is the
tsm_system_rows
module that allows you to specify the number of rows to return directly.To use this within ActiveRecord, first enable the extension within a migration:
Then modify the
from
clause of the query:I don't know if the
SYSTEM_ROWS
sampling method will be entirely random or if it just returns the first row from a random page.Most of this information was taken from a 2ndQuadrant blog post written by Gulcin Yildirim.
Strongly Recommend this gem for random records, which is specially designed for table with lots of data rows:
https://github.com/haopingfan/quick_random_records
All other answers perform badly with large database, except this gem:
4.6ms
totally.User.order('RAND()').limit(10)
cost733.0ms
.offset
approach cost245.4ms
totally.User.all.sample(10)
approach cost573.4ms
.Note: My table only has 120,000 users. The more records you have, the more enormous the difference of performance will be.
After seeing so many answers I decided to benchmark them all on my PostgreSQL(9.6.3) database. I use a smaller 100,000 table and got rid of the Model.order("RANDOM()").first since it was already two orders of magnitude slower.
Using a table with 2,500,000 entries with 10 columns the hands down winner was the pluck method being almost 8 times faster than the runner up(offset. I only ran this on a local server so that number might be inflated but its bigger enough that the pluck method is what I'll end up using. It's also worth noting that this might cause issues is you pluck more than 1 result at a time since each one of those will be unique aka less random.
Pluck wins running 100 time on my 25,000,000 row table Edit: actually this time includes the pluck in the loop if I take it out it it runs about as fast as simple iteration on the id. However; it does take up a fair amount of RAM.
Here is the data running 2000 times on my 100,000 row table to rule out random
.order('RANDOM()').limit(limit)
looks neat but is slow for large tables because it needs to fetch and sort all rows even iflimit
is 1 (internally in database but not in Rails). I'm not sure about MySQL but this happens in Postgres. More explanation in here and here.One solution for large tables is
.from("products TABLESAMPLE SYSTEM(0.5)")
where0.5
means0.5%
. However, I find this solution is still slow if you haveWHERE
conditions that filter out a lot of rows. I guess it's becauseTABLESAMPLE SYSTEM(0.5)
fetch all rows beforeWHERE
conditions apply.Another solution for large tables (but not very random) is:
where
sample_size
can be100
(but not too large otherwise it's slow and consumes a lot of memory), andlimit
can be1
. Note that although this is fast but it's not really random, it's random withinsample_size
records only.PS: Benchmark results in answers above are not reliable (at least in Postgres) because some DB queries running at 2nd time can be significantly faster than running at 1st time, thanks to DB cache. And unfortunately there is no easy way to disable cache in Postgres to make these benchmarks reliable.