I want a random selection of rows in PostgreSQL, I tried this:
select * from table where random() < 0.01;
But some other recommend this:
select * from table order by random() limit 1000;
I have a very large table with 500 Million rows, I want it to be fast.
Which approach is better? What are the differences? What is the best way to select random rows?
Add a column called
r
with typeserial
. Indexr
.Assume we have 200,000 rows, we are going to generate a random number
n
, where 0 <n
<= 200, 000.Select rows with
r > n
, sort themASC
and select the smallest one.Code:
The code is self-explanatory. The subquery in the middle is used to quickly estimate the table row counts from https://stackoverflow.com/a/7945274/1271094 .
In application level you need to execute the statement again if
n
> the number of rows or need to select multiple rows.postgresql order by random(), select rows in random order:
postgresql order by random() with a distinct:
postgresql order by random limit one row:
You can examine and compare the execution plan of both by using
A quick test on a large table1 shows, that the
ORDER BY
first sorts the complete table and then picks the first 1000 items. Sorting a large table not only reads that table but also involves reading and writing temporary files. Thewhere random() < 0.1
only scans the complete table once.For large tables this might not what you want as even one complete table scan might take to long.
A third proposal would be
This one stops the table scan as soon as 1000 rows have been found and therefore returns sooner. Of course this bogs down the randomness a bit, but perhaps this is good enough in your case.
Edit: Besides of this considerations, you might check out the already asked questions for this. Using the query
[postgresql] random
returns quite a few hits.And a linked article of depez outlining several more approaches:
1 "large" as in "the complete table will not fit into the memory".
Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table :
This example will give you 5% of elements from
mytable
.See more explanation on this blog post: http://www.postgresql.org/docs/current/static/sql-select.html
A variation of the materialized view "Possible alternative" outlined by Erwin Brandstetter is possible.
Say, for example, that you don't want duplicates in the randomized values that are returned. So you will need to set a boolean value on the primary table containing your (non-randomized) set of values.
Assuming this is the input table:
Populate the
ID_VALUES
table as needed. Then, as described by Erwin, create a materialized view that randomizes theID_VALUES
table once:Note that the materialized view does not contain the used column, because this will quickly become out-of-date. Nor does the view need to contain other columns that may be in the
id_values
table.In order to obtain (and "consume") random values, use an UPDATE-RETURNING on
id_values
, selectingid_values
fromid_values_randomized
with a join, and applying the desired criteria to obtain only relevant possibilities. For example:Change
LIMIT
as necessary -- if you only need one random value at a time, changeLIMIT
to1
.With the proper indexes on
id_values
, I believe the UPDATE-RETURNING should execute very quickly with little load. It returns randomized values with one database round-trip. The criteria for "eligible" rows can be as complex as required. New rows can be added to theid_values
table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's are added to theid_values
table.I know I'm a little late to the party, but I just found this awesome tool called pg_sample:
I tried this with a 350M rows database and it was really fast, don't know about the randomness.