I have a table in postgres that contains couple of millions of rows. I have checked on the internet and I found the following
SELECT myid FROM mytable ORDER BY RANDOM() LIMIT 1;
it works, but it's really slow... is there another way to make that query, or a direct way to select a random row without reading all the table? by the way 'myid' is an integer but it can be an empty field.
thanks
PostgreSQL 9.5 introduced a new approach for much faster sample selection: TABLESAMPLE
The syntax is
This is not the optimal solution if you want only one row selected, because you need to know the COUNT of the table to calculate the exact percentage.
To avoid a slow COUNT and use fast TABLESAMPLE for tables from 1 row to billions of rows, you can do:
This might not look so elegant, but probably is faster than any of the other answers.
To decide whether you want to use BERNULLI oder SYSTEM, read about the difference at http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/
Check this link out for some different options. http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
Update: (A.Hatchkins)
The summary of the (very) long article is as follows.
The author lists four approaches:
1)
ORDER BY random() LIMIT 1;
-- slow2)
ORDER BY id where id>=random()*N LIMIT 1
-- nonuniform if there're gaps3) random column -- needs to be updated every now and then
4) custom random aggregate -- cunning method, could be slow: random() needs to be generated N times
and suggests to improve method #2 by using
5)
ORDER BY id where id=random()*N LIMIT 1
with subsequent requeries if the result is empty.You might want to experiment with
OFFSET
, as inSELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;
The
N
is the number of rows inmytable
. You may need to first do aSELECT COUNT(*)
to figure out the value ofN
.Update (by Antony Hatchkins)
You must use
floor
here:Consider a table of 2 rows;
random()*N
generates0 <= x < 2
and for exampleSELECT myid FROM mytable OFFSET 1.7 LIMIT 1;
returns 0 rows because of implicit rounding to nearest int.I tried this with a subquery and it worked fine. Offset, at least in Postgresql v8.4.4 works fine.
I've came up with a very fast solution without
TABLESAMPLE
. Much faster thanOFFSET random()*N LIMIT 1
. It doesn't even require table count.The idea is to create an expression index with random but predictable data, for example
md5(primary key)
.Here is a test with 1M rows sample data:
Result:
This query can sometimes (with about 1/Number_of_rows probability) return 0 rows, so it needs to be checked and rerun. Also probabilities aren't exactly the same - some rows are more probable than others.
For comparison:
Results vary widely, but can be pretty bad:
The easiest and fastest way to fetch random row is to use the
tsm_system_rows
extension :Then you can select the exact number of rows you want :
This is available with PostgreSQL 9.5 and later.
See: https://www.postgresql.org/docs/current/static/tsm-system-rows.html