randomizing large dataset

2019-02-19 13:29发布

I am trying to find a way to get a random selection from a large dataset.

We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.

I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.

So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.

I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.

Any help would be greatly appreciated!

标签: mysql random
3条回答
迷人小祖宗
2楼-- · 2019-02-19 13:42

Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...

查看更多
我命由我不由天
3楼-- · 2019-02-19 13:54

You could solve this with some denormalization:

  • Build a secondary table that contains the same pkeys and statuses as your data table
  • Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey    Status    StatusPkey
1       A         1
2       A         2
3       B         1
4       B         2
5       C         1
...     C         ...
n       C         m (where m = # of C statuses)

When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.

There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.

查看更多
欢心
4楼-- · 2019-02-19 14:09

You can do this efficiently, but you have to do it in two queries.

First get a random offset scaled by the number of rows that match your 5% conditions:

SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))

This returns an integer. Next, use the integer as an offset in a LIMIT expression:

SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?

Not every problem must be solved in a single SQL query.

查看更多
登录 后发表回答