Efficiently select random rows from large resultse

2019-06-25 12:52发布

问题:

I want to select a handful of random rows from the results of a complex query on a very large table (many millions of rows).

I am using SQL Server 2008, and the proper way to do this efficiently seems to be the TABLESAMPLE clause.

Note 1: I am not interested in the popular "order by NEWID()" solution - it is inefficient for large tables.

Note 2: Since my query is complex, I do not want to have to first calculate the COUNT over it, if possible.

Note 3: Since the resultset is huge, I do not want to have to traverse it myself, such as is suggested here.

The kicker is that I am using LINQ. Specifically, LINQ-To-Entities.

Is there a LINQ-friendly way to use TABLESAMPLE?

Even if there is no direct support, is there some way I can write most of my query in LINQ and then do a small amount of manual SQL to perform the TABLESAMPLE?

回答1:

Not a direct answer to your question but you can use this technique to select a random percentage sample of individual rows. The following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:

SELECT * FROM Sales.SalesOrderDetail   
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

Possibly of interest: T-SQL: Generating Random Numbers, Random Sampling and Random ‘Goodness’



回答2:

Something like this should work (syntax may not be exactly right but you should get the idea):

var rowCount = context.MyTable.Count();

int randomInt = new Random().Next(rowCount);    
var query = context.MyTable.Skip(randomInt).FirstOrDefault();


回答3:

With EF, you could create a defining query in your model using the tablesample construct. See http://msdn.microsoft.com/en-us/library/cc982038.aspx. Alternatively, you could create a randomized view in your database and then include that view in your model.



回答4:

It seems that what I want to accomplish is not even possible in the first place.

TABLESAMPLE cannot be used on derived tables, so it is not even feasible to have a complex query generating a large result set and then get a random sampling with TABLESAMPLE.

TABLESAMPLE is only something that can be used on the base tables that go into a query, before joins and soforth. (see documentation)

This MSDN link describes a way to get a random percentage of results efficiently, so the best way to do approximately what I want may be to use that in a view, and build my LINQ off of that view.

Thank you all for the input.