Grouping data into fuzzy gaps and islands

2019-02-28 23:46发布

问题:

This is essentially a gaps and islands problem however it's atypical. I did cut the example down to bare minimum. I need to identify gaps that exceed a certain threshold and duplicates can't be a problem although this example removes them.
In any case the common solution of using ROW_NUMBER() doesn't help since gaps of even 1 can't be handled and the gap value is a parameter in 'real life'.

The code below actually works correctly. And it's super fast! But if you look at it you'll see why people are rather gun shy about relying upon it. The method was first published 9 years ago here http://www.sqlservercentral.com/articles/T-SQL/68467/ and I've read all 32 pages of comments. Nobody has successfully poked holes in it other than to say "it's not documented behavior". I've tried it on every version from 2005 to 2019 and it works.

The question is, beyond using a cursor or while loop to look at many millions of rows 1 by 1 - which takes, I don't know how long because I cancel after 30 min. - is there a 'supported' way to get the same results in a reasonable amount of time? Even 100x slower would complete 4M rows in 10 minutes and I can't find a way to come close to that!

CREATE TABLE #t (CreateDate   date not null
                ,TufpID       int not null
                ,Cnt          int not null
                ,FuzzyGroup   int null);
ALTER TABLE #t ADD CONSTRAINT PK_temp PRIMARY KEY CLUSTERED (CreateDate,TufpID);

-- Takes 40 seconds to write 4.4M rows from a source of 70M rows.
INSERT INTO #T
    SELECT X.CreateDate
          ,X.TufpID
          ,Cnt          = COUNT(*)
          ,FuzzyGroup   = null
      FROM SessionState SS
     CROSS APPLY(VALUES (CAST(SS.CreateDate as date),SS.TestUser_Form_Part_id)) X(CreateDate,TufpID)
     GROUP BY X.CreateDate
             ,X.TufpID
 ORDER BY x.CreateDate,x.TufpID;

-- Takes 6 seconds to update 4.4M rows.  They WILL update in clustered index order!
-- (Provided all the rules are followed - see the link above)
DECLARE @FuzzFactor int = 38 
DECLARE @Prior      int = -@FuzzFactor; -- Insure 1st row has it's own group
DECLARE @Group      int;
DECLARE @CDate      date;
UPDATE #T
   SET @Group = FuzzyGroup  = CASE WHEN t.TufpID - @PRIOR < @FuzzFactor AND t.CreateDate = @CDate
                                   THEN @Group ELSE t.TufpID END
      ,@CDate               = CASE WHEN @CDate = t.CreateDate THEN @CDate ELSE t.CreateDate END
      ,@Prior               = CASE WHEN @Prior = t.TufpID-1   THEN @Prior + 1 ELSE t.TufpID END
  FROM #t t WITH (TABLOCKX) OPTION(MAXDOP 1);

After the above executes the FuzzyGroup column contains the lowest value of TufpID in the group. IOW the first row (in clustered index order) contains the value of it's own TufpID column. Thereafter every row gets the same value until the date changes or a gap size (in this case 38) is exceeded. In those cases the current TufpID becomes the value put into FuzzyGroup until another change is detected. So after 6 seconds I can run queries that group by FuzzyGroup and analyze the islands.

In practice I do some running counts and totals as well in the same pass and so it takes 8 seconds not 6 but I could do those things with window functions pretty easily if I need to so I left them off.

This is the smallest table and I'll eventually need to handle 100M rows. Thus 10 minutes for 4.4M is probably not good enough but it's a place to start.

回答1:

This should be reasonably efficient and avoid relying on undocumented behaviour

WITH T1
     AS (SELECT *,
                PrevTufpID = LAG(TufpID)
                               OVER (PARTITION BY CreateDate
                                         ORDER BY TufpID)
         FROM   #T),
     T2
     AS (SELECT *,
                _FuzzyGroup = MAX(CASE
                                    WHEN PrevTufpID IS NULL
                                          OR TufpID - PrevTufpID >= @FuzzFactor
                                      THEN TufpID
                                  END)
                                OVER (PARTITION BY CreateDate
                                          ORDER BY TufpID ROWS UNBOUNDED PRECEDING)
         FROM   T1)
UPDATE T2
SET    FuzzyGroup = _FuzzyGroup 

The execution plan has a single ordered scan through the clustered index, with the row values then flowing through some window function operators and into the update.