How do I take an efficient simple random sample in SQL? The database in question is running MySQL; my table is at least 200,000 rows, and I want a simple random sample of about 10,000.
The "obvious" answer is to:
SELECT * FROM table ORDER BY RAND() LIMIT 10000
For large tables, that's too slow: it calls RAND() for every row (which already puts it at O(n)), and sorts them, making it O(n lg n) at best. Is there a way to do this faster than O(n)?
Note: As Andrew Mao points out in the comments, If you're using this approach on SQL Server, you should use the T-SQL function NEWID(), because RAND() may return the same value for all rows.
EDIT: 5 YEARS LATER
I ran into this problem again with a bigger table, and ended up using a version of @ignorant's solution, with two tweaks:
- Sample the rows to 2-5x my desired sample size, to cheaply ORDER BY RAND()
- Save the result of RAND() to an indexed column on every insert/update. (If your data set isn't very update-heavy, you may need to find another way to keep this column fresh.)
To take a 1000-item sample of a table, I count the rows and sample the result down to, on average, 10,000 rows with the the frozen_rand column:
SELECT COUNT(*) FROM table; -- Use this to determine rand_low and rand_high
SELECT *
FROM table
WHERE frozen_rand BETWEEN %(rand_low)s AND %(rand_high)s
ORDER BY RAND() LIMIT 1000
(My actual implementation involves more work to make sure I don't undersample, and to manually wrap rand_high around, but the basic idea is "randomly cut your N down to a few thousand.")
While this makes some sacrifices, it allows me to sample the database down using an index scan, until it's small enough to ORDER BY RAND() again.
I think the fastest solution is
Here is why I think this should do the job.
This assumes that rand() is generating numbers in a uniform distribution. It is the quickest way to do this.
I saw that someone had recommended that solution and they got shot down without proof.. here is what I would say to that -
mysql is very capable of generating random numbers for each row. Try this -
select rand() from INFORMATION_SCHEMA.TABLES limit 10;
Since the database in question is mySQL, this is the right solution.
If you need exactly
m
rows, realistically you'll generate your subset of IDs outside of SQL. Most methods require at some point to select the "nth" entry, and SQL tables are really not arrays at all. The assumption that the keys are consecutive in order to just join random ints between 1 and the count is also difficult to satisfy — MySQL for example doesn't support it natively, and the lock conditions are... tricky.Here's an
O(max(n, m lg n))
-time,O(n)
-space solution assuming just plain BTREE keys:O(n)
m
swaps, and extract the subarray[0:m-1]
inϴ(m)
SELECT ... WHERE id IN (<subarray>)
) inO(m lg n)
Any method that generates the random subset outside of SQL must have at least this complexity. The join can't be any faster than
O(m lg n)
with BTREE (soO(m)
claims are fantasy for most engines) and the shuffle is bounded belown
andm lg n
and doesn't affect the asymptotic behavior.In Pythonic pseudocode:
Maybe you could do
Apparently in some versions of SQL there's a
TABLESAMPLE
command, but it's not in all SQL implementations (notably, Redshift).http://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx
There's a very interesting discussion of this type of issue here: http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
I think with absolutely no assumptions about the table that your O(n lg n) solution is the best. Though actually with a good optimizer or a slightly different technique the query you list may be a bit better, O(m*n) where m is the number of random rows desired, as it wouldn't necesssarily have to sort the whole large array, it could just search for the smallest m times. But for the sort of numbers you posted, m is bigger than lg n anyway.
Three asumptions we might try out:
there is a unique, indexed, primary key in the table
the number of random rows you want to select (m) is much smaller than the number of rows in the table (n)
the unique primary key is an integer that ranges from 1 to n with no gaps
With only assumptions 1 and 2 I think this can be done in O(n), though you'll need to write a whole index to the table to match assumption 3, so it's not necesarily a fast O(n). If we can ADDITIONALLY assume something else nice about the table, we can do the task in O(m log m). Assumption 3 would be an easy nice additional property to work with. With a nice random number generator that guaranteed no duplicates when generating m numbers in a row, an O(m) solution would be possible.
Given the three assumptions, the basic idea is to generate m unique random numbers between 1 and n, and then select the rows with those keys from the table. I don't have mysql or anything in front of me right now, so in slightly pseudocode this would look something like:
If you were really concerned about efficiency, you might consider doing the random key generation in some sort of procedural language and inserting the results in the database, as almost anything other than SQL would probably be better at the sort of looping and random number generation required.
I want to point out that all of these solutions appear to sample without replacement. Selecting the top K rows from a random sort or joining to a table that contains unique keys in random order will yield a random sample generated without replacement.
If you want your sample to be independent, you'll need to sample with replacement. See Question 25451034 for one example of how to do this using a JOIN in a manner similar to user12861's solution. The solution is written for T-SQL, but the concept works in any SQL db.