Using NEWID() with CTE to produce random subset of

I'm writing some SQL in a stored procedure to reduce a dataset to a limited random number of rows that I want to report on.

The report starts with a Group of Users and a filter is applied to specify the total number of random rows required (@SampleLimit).

To achieve the desired result, I start by creating a CTE (temp table) with:

The top(@SampleLimit) applied
group by UserId (as the UserID appears multiple times)
order by NEWID() to put the results in a random order

SQL:

; with cte_temp as 
       (select top(@SampleLimit) UserId from QueryResults 
        where (GroupId = @GroupId)
        group by UserId order by NEWID())

Once I have this result set, I then delete any results where the UserId is NOT IN the CTE created in the previous step.

delete QueryResults 
where (GroupId = @GroupId) and (UserId not in(select UserId from cte_temp))

The issue that I'm having is that from time to time, I get more results than specified in the @SampleLimit and other times it works exactly as expected.

I've tried breaking up the SQL and executing it outside the application and I cannot reproduce the issue.

Is there anything fundamentally wrong with what I am doing that could explain why I occasionally get more results that I request?

For completeness - my re-factored solution based on below answer:

select top(@SampleLimit) UserId into #T1
from  QueryResults
where (GroupId = @GroupId)
group by UserId
order by NEWID() 

delete QueryResults 
where (GroupId = @GroupId) and (UserId not in(select UserId from #T1))

标签： sql-server stored-procedures sql-server-2008-r2 newid

1条回答

等我变得足够好

2楼-- · 2020-03-29 08:34

It is undeterministic how many times the SELECT statement involving NEWID() will be executed.

If you get a nested loops anti semi join between QueryResults and cte_temp and there is no spool in the plan it will likely be re-evaluated as many times as there are rows in QueryResults this means that for each outer row the set that is being compared against with NOT IN may be entirely different.

Instead of using a CTE you can materialize the results into a temporary table to avoid this.

INSERT INTO #T
SELECT TOP(@SampleLimit) UserId
FROM   QueryResults
WHERE  ( GroupId = @GroupId )
GROUP  BY UserId
ORDER  BY NEWID()

Then reference that in the DELETE

0人赞添加讨论(0) 举报

Using NEWID() with CTE to produce random subset of

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间