SQL ROW_NUMBER() over performance problem

2019-04-14 22:04发布

问题:

I have this SQL that works fine.

Want the my filter to return the LATEST unique SessionGuids with the highest UserSessionSequenceID.

Problem is performance sucks - even though I have good indexes. How can I rewrite this - to omit the ROW_NUMBER line?

SELECT TOP(@resultCount) * FROM 
(
    SELECT
        [UserSessionSequenceID]
        ,[SessionGuid]
        ,[IP]
        ,[Url]
        ,[UrlTitle]
        ,[SiteID]
        ,[BrowserWidth]
        ,[BrowserHeight]
        ,[Browser]
        ,[BrowserVersion]
        ,[Referer]
        ,[Timestamp]
        ,ROW_NUMBER() over (PARTITION BY [SessionGuid] 
                                    ORDER BY UserSessionSequenceID DESC) AS sort 
   FROM [tblSequence]
) AS t     
WHERE ([Timestamp] > DATEADD(mi, -@minutes, GETDATE())) 
  AND (SiteID = @siteID) 
  AND sort = 1
ORDER BY [UserSessionSequenceID] DESC

Thanks a lot :-)

回答1:

even though I have good indexes

No offense, but let us be the judge of that. Always post the exact schema for your tables, including all indexes and cardinalities, when asking SQL Server performance questions.

For example, lets consider the following table structure:

create table tblSequence (
 [UserSessionSequenceID] int not null
        ,[SessionGuid] uniqueidentifier not null
        ,[SiteID] int not null
        ,[Timestamp] datetime not null
        , filler varchar(512));
go

create clustered index cdxSequence on tblSequence (SiteID, [Timestamp]);
go

which is same as yours, but all fields not relevant to the performance problems are aggregated into a generic filler. Lets see, how bad is the performance on, say, 1M rows for about 50k sessions? Lets fill up the table with random data, but we'll simulate what ammounts to 'user activity':

set nocount on;
declare @i int = 0, @sc int = 1;
declare @SessionGuid uniqueidentifier = newid()
    , @siteID int = 1
    , @Timestamp datetime = dateadd(day, rand()*1000, '20070101')
    , @UserSessionSequenceID int = 0;
begin tran;
while @i<1000000
begin
    insert into tblSequence (
        [UserSessionSequenceID]
        ,[SessionGuid]
        ,[SiteID]
        ,[Timestamp]
        , filler)
    values (
        @UserSessionSequenceID
        , @SessionGuid
        , @siteID
        , @timestamp
        , replicate('X', rand()*512));

    if rand()*100 < 5
    begin
        set @SessionGuid = newid();
        set @siteID = rand() * 10;
        set @Timestamp = dateadd(day, rand()*1000, '20070101');
        set @UserSessionSequenceID = 0;
        set @sc += 1;
    end
    else
    begin
        set @timestamp = dateadd(second, rand()*300, @timestamp);
        set @UserSessionSequenceID += 1;
    end

    set @i += 1;
    if (@i % 1000) = 0
    begin
        raiserror(N'Inserted %i rows, %i sessions', 0, 1, @i, @sc);
        commit;
        begin tran;
    end
end
commit;

This takes about 1 minute to fill up. Now lets query the same query you asked: what is the last action of any user session on site X in the last Y minutes? I'll have to use a specific date for @now instead of GETDATE() becaus emy dtaa is simulated, not real, so I'm using whatever max timestamp was filled in randomly for SiteId 1:

set statistics time on;
set statistics io on;

declare @resultCount int = 30;
declare @minutes int = 60*24;
declare @siteID int = 1;
declare @now datetime = '2009-09-26 02:08:27.000';

SELECT TOP(@resultCount) * FROM  
( 
    SELECT 
        [UserSessionSequenceID] 
        ,[SessionGuid] 
        , SiteID
        , Filler
        ,[Timestamp] 
        ,ROW_NUMBER() over (PARTITION BY [SessionGuid]  
                                    ORDER BY UserSessionSequenceID DESC) AS sort  
   FROM [tblSequence] 
   where SiteID = @siteID
   and [Timestamp] > DATEADD(mi, -@minutes, @now)
) AS t      
WHERE sort = 1 
ORDER BY [UserSessionSequenceID] DESC ;

This is same query as yours, but the restrictive filters are moved inside the ROW_NUMBER() part subquery. The results come back in:

Table 'tblSequence'. Scan count 1, logical reads 12, physical reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 31 ms.

31 ms response time on a warm cache, 12 pages read out of the nearly 60k pages of the table.

Updated

After reading again original query I realize my modified query is different. You only need new sessions. I still believe that the filtering out by SiteID and Timestmap is the only way to get the necessary performance, so the solution is to validate the candidate finds with a NOT EXISTS condition:

SELECT TOP(@resultCount) * FROM  
( 
    SELECT 
        [UserSessionSequenceID] 
        ,[SessionGuid] 
        , SiteID
        , Filler
        ,[Timestamp] 
        ,ROW_NUMBER() over (
            PARTITION BY [SessionGuid]  
            ORDER BY UserSessionSequenceID DESC) 
         AS sort  
   FROM [tblSequence] 
   where SiteID = @siteID
   and [Timestamp] > DATEADD(mi, -@minutes, @now)
) AS new
WHERE sort = 1 
and not exists (
    select SessionGuid 
    from tblSequence
    where SiteID = @siteID
    and SessionGuid = new.SessionGuid
    and [TimeStamp] < DATEADD(mi, -@minutes, @now)
)
ORDER BY [UserSessionSequenceID] DESC 

This returns on my laptop, for 1M rows over 400k sessions in 40 ms from a warm cache:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'tblSequence'. Scan count 2, logical reads 709, physical reads 0

 SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 40 ms.


回答2:

Try these - should be equivalent query but you'll have to compare the query plans:

Using a JOIN

  SELECT DISTINCT  TOP(@resultCount)
         s.usersessionsequenceid,
         s.sessionguid,
         s.ip,
         s.url,
         s.urltitle,
         s.siteid,
         s.browserwidth,
         s.browserheight,
         s.browser,
         s.browserversion,
         s.referer,
         s.timestamp
    FROM tblsequence s
    JOIN (SELECT t.sessionquid,
                 MAX(t.timestamp) AS max_ts
            FROM tblsequence t
        GROUP BY t.sessionguid) x ON x.sessionguid = s.sessionguid
                                 AND x.max_ts = s.timestamp
   WHERE s.siteid = @SiteID
     AND s.timestamp > DATEADD(mi, -@minutes, GETDATE())
ORDER BY s.usersessionsequenceid DESC

Using EXISTS

  SELECT TOP(@resultCount)
         s.usersessionsequenceid,
         s.sessionguid,
         s.ip,
         s.url,
         s.urltitle,
         s.siteid,
         s.browserwidth,
         s.browserheight,
         s.browser,
         s.browserversion,
         s.referer,
         s.timestamp
    FROM tblsequence s
   WHERE s.siteid = @SiteID
     AND s.timestamp > DATEADD(mi, -@minutes, GETDATE())
     AND EXISTS(SELECT NULL
                  FROM tblsequence t
                 WHERE t.sessionguid = s.sessionguid
              GROUP BY t.sessionguid
                HAVING MAX(t.timestamp) = s.timestamp
ORDER BY s.usersessionsequenceid DESC

But if you want to get where the value is 2 or more, you'll have to stick with your ROW_NUMBER query.