quick random row selection in Postgres

I have a table in postgres that contains couple of millions of rows. I have checked on the internet and I found the following

SELECT myid FROM mytable ORDER BY RANDOM() LIMIT 1;

it works, but it's really slow... is there another way to make that query, or a direct way to select a random row without reading all the table? by the way 'myid' is an integer but it can be an empty field.

thanks

标签： postgresql random-access

7条回答

再贱就再见

2楼-- · 2019-01-03 09:07

PostgreSQL 9.5 introduced a new approach for much faster sample selection: TABLESAMPLE

The syntax is

SELECT * FROM my_table TABLESAMPLE BERNOULLI(percentage);
SELECT * FROM my_table TABLESAMPLE SYSTEM(percentage);

This is not the optimal solution if you want only one row selected, because you need to know the COUNT of the table to calculate the exact percentage.

To avoid a slow COUNT and use fast TABLESAMPLE for tables from 1 row to billions of rows, you can do:

 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.000001) LIMIT 1;
 if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.00001) LIMIT 1;
 if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.0001) LIMIT 1;
 if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.001) LIMIT 1;
 ...

This might not look so elegant, but probably is faster than any of the other answers.

To decide whether you want to use BERNULLI oder SYSTEM, read about the difference at http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

3楼-- · 2019-01-03 09:12

Check this link out for some different options. http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/

Update: (A.Hatchkins)

The summary of the (very) long article is as follows.

The author lists four approaches:

1) ORDER BY random() LIMIT 1; -- slow

2) ORDER BY id where id>=random()*N LIMIT 1 -- nonuniform if there're gaps

3) random column -- needs to be updated every now and then

4) custom random aggregate -- cunning method, could be slow: random() needs to be generated N times

and suggests to improve method #2 by using

5) ORDER BY id where id=random()*N LIMIT 1 with subsequent requeries if the result is empty.

0人赞添加讨论(0) 举报

叼着烟拽天下

4楼-- · 2019-01-03 09:16

You might want to experiment with OFFSET, as in

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;

The N is the number of rows in mytable. You may need to first do a SELECT COUNT(*) to figure out the value of N.

Update (by Antony Hatchkins)

You must use floor here:

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;

Consider a table of 2 rows; random()*N generates 0 <= x < 2 and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1; returns 0 rows because of implicit rounding to nearest int.

0人赞添加讨论(0) 举报

爷的心禁止访问

5楼-- · 2019-01-03 09:16

I tried this with a subquery and it worked fine. Offset, at least in Postgresql v8.4.4 works fine.

select * from mytable offset random() * (select count(*) from mytable) limit 1 ;

0人赞添加讨论(0) 举报

不美不萌又怎样

6楼-- · 2019-01-03 09:22

I've came up with a very fast solution without TABLESAMPLE. Much faster than OFFSET random()*N LIMIT 1. It doesn't even require table count.

The idea is to create an expression index with random but predictable data, for example md5(primary key).

Here is a test with 1M rows sample data:

create table randtest (id serial primary key, data int not null);

insert into randtest (data) select (random()*1000000)::int from generate_series(1,1000000);

create index randtest_md5_id_idx on randtest (md5(id::text));

explain analyze
select * from randtest where md5(id::text)>md5(random()::text)
order by md5(id::text) limit 1;

Result:

 Limit  (cost=0.42..0.68 rows=1 width=8) (actual time=6.219..6.220 rows=1 loops=1)
   ->  Index Scan using randtest_md5_id_idx on randtest  (cost=0.42..84040.42 rows=333333 width=8) (actual time=6.217..6.217 rows=1 loops=1)
         Filter: (md5((id)::text) > md5((random())::text))
         Rows Removed by Filter: 1831
 Total runtime: 6.245 ms

This query can sometimes (with about 1/Number_of_rows probability) return 0 rows, so it needs to be checked and rerun. Also probabilities aren't exactly the same - some rows are more probable than others.

For comparison:

explain analyze SELECT id FROM randtest OFFSET random()*1000000 LIMIT 1;

Results vary widely, but can be pretty bad:

 Limit  (cost=1442.50..1442.51 rows=1 width=4) (actual time=179.183..179.184 rows=1 loops=1)
   ->  Seq Scan on randtest  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.016..134.835 rows=915702 loops=1)
 Total runtime: 179.211 ms
(3 rows)

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

7楼-- · 2019-01-03 09:23

The easiest and fastest way to fetch random row is to use the tsm_system_rows extension :

CREATE EXTENSION IF NOT EXISTS tsm_system_rows;

Then you can select the exact number of rows you want :

SELECT myid  FROM mytable TABLESAMPLE SYSTEM_ROWS(1);

This is available with PostgreSQL 9.5 and later.

See: https://www.postgresql.org/docs/current/static/tsm-system-rows.html

0人赞添加讨论(0) 举报

1 2 下一页

quick random row selection in Postgres

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间