I'm writing some SQL in a stored procedure to reduce a dataset to a limited random number of rows that I want to report on.
The report starts with a Group
of Users
and a filter is applied to specify the total number of random rows required (@SampleLimit
).
To achieve the desired result, I start by creating a CTE (temp table) with:
- The
top(@SampleLimit)
applied group by UserId
(as the UserID appears multiple times)order by NEWID()
to put the results in a random order
SQL:
; with cte_temp as
(select top(@SampleLimit) UserId from QueryResults
where (GroupId = @GroupId)
group by UserId order by NEWID())
Once I have this result set, I then delete any results where the UserId is NOT IN
the CTE created in the previous step.
delete QueryResults
where (GroupId = @GroupId) and (UserId not in(select UserId from cte_temp))
The issue that I'm having is that from time to time, I get more results than specified in the @SampleLimit
and other times it works exactly as expected.
I've tried breaking up the SQL and executing it outside the application and I cannot reproduce the issue.
Is there anything fundamentally wrong with what I am doing that could explain why I occasionally get more results that I request?
For completeness - my re-factored solution based on below answer:
select top(@SampleLimit) UserId into #T1
from QueryResults
where (GroupId = @GroupId)
group by UserId
order by NEWID()
delete QueryResults
where (GroupId = @GroupId) and (UserId not in(select UserId from #T1))
It is undeterministic how many times the
SELECT
statement involvingNEWID()
will be executed.If you get a nested loops anti semi join between
QueryResults
andcte_temp
and there is no spool in the plan it will likely be re-evaluated as many times as there are rows inQueryResults
this means that for each outer row the set that is being compared against withNOT IN
may be entirely different.Instead of using a CTE you can materialize the results into a temporary table to avoid this.
Then reference that in the
DELETE