Using NEWID() with CTE to produce random subset of

2020-03-29 07:41发布

I'm writing some SQL in a stored procedure to reduce a dataset to a limited random number of rows that I want to report on.

The report starts with a Group of Users and a filter is applied to specify the total number of random rows required (@SampleLimit).

To achieve the desired result, I start by creating a CTE (temp table) with:

  • The top(@SampleLimit) applied
  • group by UserId (as the UserID appears multiple times)
  • order by NEWID() to put the results in a random order

SQL:

; with cte_temp as 
       (select top(@SampleLimit) UserId from QueryResults 
        where (GroupId = @GroupId)
        group by UserId order by NEWID()) 

Once I have this result set, I then delete any results where the UserId is NOT IN the CTE created in the previous step.

delete QueryResults 
where (GroupId = @GroupId) and (UserId not in(select UserId from cte_temp))

The issue that I'm having is that from time to time, I get more results than specified in the @SampleLimit and other times it works exactly as expected.

I've tried breaking up the SQL and executing it outside the application and I cannot reproduce the issue.

Is there anything fundamentally wrong with what I am doing that could explain why I occasionally get more results that I request?

For completeness - my re-factored solution based on below answer:

select top(@SampleLimit) UserId into #T1
from  QueryResults
where (GroupId = @GroupId)
group by UserId
order by NEWID() 

delete QueryResults 
where (GroupId = @GroupId) and (UserId not in(select UserId from #T1))

1条回答
等我变得足够好
2楼-- · 2020-03-29 08:34

It is undeterministic how many times the SELECT statement involving NEWID() will be executed.

If you get a nested loops anti semi join between QueryResults and cte_temp and there is no spool in the plan it will likely be re-evaluated as many times as there are rows in QueryResults this means that for each outer row the set that is being compared against with NOT IN may be entirely different.

Instead of using a CTE you can materialize the results into a temporary table to avoid this.

INSERT INTO #T
SELECT TOP(@SampleLimit) UserId
FROM   QueryResults
WHERE  ( GroupId = @GroupId )
GROUP  BY UserId
ORDER  BY NEWID() 

Then reference that in the DELETE

查看更多
登录 后发表回答