Partitioning results in a running totals query

2020-02-11 09:38发布

问题:

I'm looking for a fast way to create cumulative totals in a large SQL Server 2008 data set that partition by a particular column, potentially by using a multiple assignment variable solution. As a very basic example, I'd like to create the "cumulative_total" column below:

user_id | month | total | cumulative_total

1       | 1     | 2.0   | 2.0
1       | 2     | 1.0   | 3.0
1       | 3     | 3.5   | 8.5

2       | 1     | 0.5   | 0.5
2       | 2     | 1.5   | 2.0
2       | 3     | 2.0   | 4.0

We have traditionally done this with correlated subqueries, but over large amounts of data (200,000+ rows and several different categories of running total) this isn't giving us ideal performance.

I recently read about using multiple assignment variables for cumulative summing here:

http://sqlblog.com/blogs/paul_nielsen/archive/2007/12/06/cumulative-totals-screencast.aspx

In the example in that blog the cumulative variable solution looks like this:

UPDATE my_table
SET @CumulativeTotal=cumulative_total=@CumulativeTotal+ISNULL(total, 0)

This solution seems brilliantly fast for summing for a single user in the above example (user 1 or user 2). However, I need to effectively partition by user - give me the cumulative total by user by month.

Does anyone know of a way of extending the multiple assignment variable concept to solve this, or any other ideas other than correlated subqueries or cursors?

Many thanks for any tips.

回答1:

Your options in SQL Server 2008 are reasonably limited - in that you can either do something based on the method as above (which is called a 'quirky update') or you can do something in the CLR.

Personally I would go with the CLR because it's guaranteed to work, while the quirky update syntax isn't something that's formally supported (so might break in future versions).

The variation on quirky update syntax you're looking for would be something like:

UPDATE my_table
SET @CumulativeTotal=cumulative_total=ISNULL(total, 0) + 
        CASE WHEN @user=@lastUser THEN @CumulativeTotal ELSE 0 END, 
    @user=lastUser

It's worth noting that in SQL Server 2012 introduces RANGE support to windowing functions, and so this is expressible in a way that is the most efficient, while being 100% supported.



回答2:

If you don't need to STORE the data (which you shouldn't, because you need to update the running totals any time any row is changed, added or deleted), and if you don't trust the quirky update (which you shouldn't, because it isn't guaranteed to work and its behavior could change with a hotfix, service pack, upgrade, or even an underlying index or statistics change), you can try this type of query at runtime. This is a method fellow MVP Hugo Kornelis coined "set-based iteration" (he posted something similar in one of his chapters of SQL Server MVP Deep Dives). Since running totals typically requires a cursor over the entire set, a quirky update over the entire set, or a single non-linear self-join that becomes more and more expensive as the row counts increase, the trick here is to loop through some finite element in the set (in this case, the "rank" of each row in terms of month, for each user - and you process only each rank once for all user/month combinations at that rank, so instead of looping through 200,000 rows, you loop up to 24 times).

DECLARE @t TABLE
(
  [user_id] INT, 
  [month] TINYINT,
  total DECIMAL(10,1), 
  RunningTotal DECIMAL(10,1), 
  Rnk INT
);

INSERT @t SELECT [user_id], [month], total, total, 
  RANK() OVER (PARTITION BY [user_id] ORDER BY [month]) 
  FROM dbo.my_table;

DECLARE @rnk INT = 1, @rc INT = 1;

WHILE @rc > 0
BEGIN
  SET @rnk += 1;

  UPDATE c SET RunningTotal = p.RunningTotal + c.total
    FROM @t AS c INNER JOIN @t AS p
    ON c.[user_id] = p.[user_id]
    AND p.rnk = @rnk - 1
    AND c.rnk = @rnk;

  SET @rc = @@ROWCOUNT;
END

SELECT [user_id], [month], total, RunningTotal
FROM @t
ORDER BY [user_id], rnk;

Results:

user_id  month   total   RunningTotal
-------  -----   -----   ------------
1        1       2.0     2.0
1        2       1.0     3.0
1        3       3.5     6.5 -- I think your calculation is off
2        1       0.5     0.5
2        2       1.5     2.0
2        3       2.0     4.0

Of course you can update the base table from this table variable, but why bother, since those stored values are only good until the next time the table is touched by any DML statement?

UPDATE mt
  SET cumulative_total = t.RunningTotal
  FROM dbo.my_table AS mt
  INNER JOIN @t AS t
  ON mt.[user_id] = t.[user_id]
  AND mt.[month] = t.[month];

Since we're not relying on implicit ordering of any kind, this is 100% supported and deserves a performance comparison relative to the unsupported quirky update. Even if it doesn't beat it but comes close, you should consider using it anyway IMHO.

As for the SQL Server 2012 solution, Matt mentions RANGE but since this method uses an on-disk spool you should also test with ROWS instead of just running with RANGE. Here is a quick example for your case:

SELECT
  [user_id],
  [month],
  total,
  RunningTotal = SUM(total) OVER 
  (
    PARTITION BY [user_id] 
    ORDER BY [month] ROWS UNBOUNDED PRECEDING
  )
FROM dbo.my_table
ORDER BY [user_id], [month];

Compare this with RANGE UNBOUNDED PRECEDING or no ROWS\RANGE at all (which will also use the RANGE on-disk spool). The above will have lower overall duration and way less I/O, even though the plan looks slightly more complex (an additional sequence project operator).

I've recently published a blog post outlining some performance differences I observed for a specific running totals scenario:

http://www.sqlperformance.com/2012/07/t-sql-queries/running-totals