Select n random rows from SQL Server table

2019-01-01 01:44发布

I've got a SQL Server table with about 50,000 rows in it. I want to select about 5,000 of those rows at random. I've thought of a complicated way, creating a temp table with a "random number" column, copying my table into that, looping through the temp table and updating each row with RAND(), and then selecting from that table where the random number column < 0.1. I'm looking for a simpler way to do it, in a single statement if possible.

This article suggest using the NEWID() function. That looks promising, but I can't see how I could reliably select a certain percentage of rows.

Anybody ever do this before? Any ideas?

15条回答
ら面具成の殇う
2楼-- · 2019-01-01 02:15

This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID():

SELECT TOP [number] 
FROM table_name
ORDER BY RAND(CHECKSUM(*) * RAND())
查看更多
泪湿衣
3楼-- · 2019-01-01 02:16

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.

TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).

For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:

If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
              / CAST (0x7fffffff AS int)

The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.

When run against a table with 1,000,000 rows, here are my results:

SET STATISTICS TIME ON
SET STATISTICS IO ON

/* newid()
   rows returned: 10000
   logical reads: 3359
   CPU time: 3312 ms
   elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()

/* TABLESAMPLE
   rows returned: 9269 (varies)
   logical reads: 32
   CPU time: 0 ms
   elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)

/* Filter
   rows returned: 9994 (varies)
   logical reads: 3359
   CPU time: 641 ms
   elapsed time: 627 ms
*/    
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float) 
              / CAST (0x7fffffff AS int)

SET STATISTICS IO OFF
SET STATISTICS TIME OFF

If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

查看更多
孤独总比滥情好
4楼-- · 2019-01-01 02:17
select top 10 percent * from [yourtable] order by newid()

In response to the "pure trash" comment concerning large tables: you could do it like this to improve performance.

select  * from [yourtable] where [yourPk] in 
(select top 10 percent [yourPk] from [yourtable] order by newid())

The cost of this will be the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.

查看更多
泛滥B
5楼-- · 2019-01-01 02:17

Depending on your needs, TABLESAMPLE will get you nearly as random and better performance. this is available on MS SQL server 2005 and later.

TABLESAMPLE will return data from random pages instead of random rows and therefore deos not even retrieve data that it will not return.

On a very large table I tested

select top 1 percent * from [tablename] order by newid()

took more than 20 minutes.

select * from [tablename] tablesample(1 percent)

took 2 minutes.

Performance will also improve on smaller samples in TABLESAMPLE whereas it will not with newid().

Please keep in mind that this is not as random as the newid() method but will give you a decent sampling.

See the MSDN page.

查看更多
听够珍惜
6楼-- · 2019-01-01 02:18

This link have a interesting comparison between Orderby(NEWID()) and other methods for tables with 1, 7, and 13 millions of rows.

Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables.

SELECT TOP 10 PERCENT *
  FROM Table1
  ORDER BY NEWID()

However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:

  1. The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time.
  2. In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command.

What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:

SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.

Please read the full article in MSDN.

查看更多
琉璃瓶的回忆
7楼-- · 2019-01-01 02:19

Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.

For MS SQL:

Minimum example:

select top 10 percent *
from table_name
order by rand(checksum(*))

Normalized execution time: 1.00

NewId() example:

select top 10 percent *
from table_name
order by newid()

Normalized execution time: 1.02

NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.

Selection with Initial Seed:

declare @seed int
set @seed = Year(getdate()) * month(getdate()) /* any other initial seed here */

select top 10 percent *
from table_name
order by rand(checksum(*) % @seed) /* any other math function here */

If you need to select the same set given a seed, this seems to work.

查看更多
登录 后发表回答