Selecting random rows from a big table in H2 datab

2019-05-28 12:39发布

问题:

I have a big table in my database (potentially millions of records) and I need to select #X random rows (let's say #X between 10 and 50) , but I need this query to be as optimal as possible.

The table looks like this:

CREATE TABLE sample (
    id bigint auto_increment PRIMARY KEY,
    user_id bigint NOT NULL,
    screen_name VARCHAR NOT NULL,
    ...
);

I've searched around and I found answers like this:

SELECT * FROM sample ORDER BY RAND() limit X.

But It looks to me that this will fetch the full table then sort it, isn't it?

I think it would be best to generate 10 or 50 random integers and do a select * from sample where rowid in (<random integer list>). But afaik, rowid concept is missing in H2, so I may opt for using the ID column in my table.

It would be awesome if I can do this task with a single SQL query.

Any better advice?

回答1:

The following script selects every nth row quite efficiently. It assumes there are no gaps in the ids. If gaps are possible, then you might want to increase the range(1, 100) to range(1, 200) or so. To get random rows, the formula at the very end would need to be changed a bit:

drop table test;

create table test(
  id bigint auto_increment primary key, 
  name varchar(255));

insert into test 
select x, 'Hello ' || x from system_range(50, 1200);

select * from test t, system_range(1, 100) range
where t.id = x * (select max(id)-min(id) from test) / 100 + 
(select min(id) from test);


回答2:

You should use column id instead of rowid. Column id exists in your table and is auto_increment.



回答3:

You can Rank your table and select Random 50 ranks out of it, avoid sorting or grouping in any way to keep it optimized.



回答4:

What I do for this is create a temp table. Generate random numbers from 1 to greatest identity value in the table. Then select from the table where their identity value is in the temp table.

"Single query way of doing this"

Create temp table (I don't know the h2 syntax for this but it supports temp tables with a field name of DesiredIdentity)

Select max identity value from the table.

Loop through with the rand command to insert random numbers into a temp table from 1 to the number of random rows you want. Set the range for random from 1 to max row count. Insure the same random number is not selected.

Then select from the table where the identity value is in the identity temp table.