I want a random selection of rows in PostgreSQL, I tried this:
select * from table where random() < 0.01;
But some other recommend this:
select * from table order by random() limit 1000;
I have a very large table with 500 Million rows, I want it to be fast.
Which approach is better? What are the differences? What is the best way to select random rows?
Here is a decision that works for me. I guess it's very simple to understand and execute.
If you know how many rows you want, check out
tsm_system_rows
.tsm_system_rows
First install the extension
Then your query,
Given your specifications (plus additional info in the comments),
The query below does not need a sequential scan of the big table, only an index scan.
First, get estimates for the main query:
The only possibly expensive part is the
count(*)
(for huge tables). Given above specifications, you don't need it. An estimate will do just fine, available at almost no cost (detailed explanation here):As long as
ct
isn't much smaller thanid_span
, the query will outperform other approaches.Generate random numbers in the
id
space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.Each
id
can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or useDISTINCT
).Join the
id
s to the big table. This should be very fast with the index in place.Finally trim surplus
id
s that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.Short version
You can simplify this query. The CTE in the query above is just for educational purposes:
Refine with rCTE
Especially if you are not so sure about gaps and estimates.
We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.
Duplicates are eliminated by the
UNION
in the rCTE.The outer
LIMIT
makes the CTE stop as soon as we have enough rows.This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.
Wrap into function
For repeated use with varying parameters:
Call:
You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use
EXECUTE
... But that's beyond the scope of this question. See:Possible alternative
IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.
Postgres 9.5 introduces
TABLESAMPLE SYSTEM (n)
It's very fast, but the result is not exactly random. The manual:
And the number of rows returned can vary wildly. For our example, to get roughly 1000 rows, try:
Where n is a percentage. The manual:
Bold emphasis mine.
Related:
Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:
See Evan's answer for details.
But that's still not exactly random.
The one with the ORDER BY is going to be the slower one.
select * from table where random() < 0.01;
goes record by record, and decides to randomly filter it or not. This is going to beO(N)
because it only needs to check each record once.select * from table order by random() limit 1000;
is going to sort the entire table, then pick the first 1000. Aside of any voodoo magic behind the scenes, the order by isO(N * log N)
.The downside to the
random() < 0.01
one is that you'll get a variable number of output records.Note, there is a better way to shuffling a set of data than sorting by random: The Fisher-Yates Shuffle, which runs in
O(N)
. Implementing the shuffle in SQL sounds like quite the challenge, though.If you want just one row, you can use a calculated
offset
derived fromcount
.