-->

Selecting a subset of rows that exceed a percentag

2019-04-07 12:58发布

问题:

I have a table with customers, users and revenue similar to below (in reality thousands of records):

Customer   User    Revenue
001        James   500
002        James   750
003        James   450
004        Sarah   100
005        Sarah   500
006        Sarah   150
007        Sarah   600
008        James   150
009        James   100

What I want to do is to return only the highest spending customers that make up 80% of the total revenue for the user.

To do this manually I would order James' customers by their revenue, work out the percentage of total and a running total percentage, then only return records up to the point that the running total hits 80%:

Customer    User    Revenue     % of total  Running Total %
002         James   750         0.38        0.38 
001         James   500         0.26        0.64 
003         James   450         0.23        0.87  <- Greater than 80%, last record
008         James   150         0.08        0.95 
009         James   100         0.05        1.00 

I've tried using a CTE but so far have come up blank. Is there any way to do this via a single query rather than manually in an Excel sheet?

回答1:

SQL Server 2012+ only

You could use windowed SUM:

WITH cte AS
(
   SELECT *,
          1.0 * Revenue/SUM(Revenue) OVER(PARTITION BY [User]) AS percentile,
          1.0 * SUM(Revenue) OVER(PARTITION BY [User] ORDER BY [Revenue] DESC)
                /SUM(Revenue) OVER(PARTITION BY [User]) AS running_percentile
   FROM tab
)
SELECT *
FROM cte 
WHERE running_percentile <= 0.8;

LiveDemo


SQL Server 2008:

WITH cte AS
(
    SELECT *, ROW_NUMBER() OVER(PARTITION BY [User] ORDER BY Revenue DESC) AS rn
    FROM t    
), cte2 AS
(
    SELECT c.Customer, c.[User], c.[Revenue]
           ,percentile         = 1.0 * Revenue / NULLIF(c3.s,0)
           ,running_percentile = 1.0 * c2.s    / NULLIF(c3.s,0)
    FROM cte c
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]
            AND c2.rn <= c.rn) c2
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]) AS c3
) 
SELECT *
FROM cte2
WHERE running_percentile <= 0.8;

LiveDemo2

Output:

╔══════════╦═══════╦═════════╦════════════════╦════════════════════╗
║ Customer ║ User  ║ Revenue ║   percentile   ║ running_percentile ║
╠══════════╬═══════╬═════════╬════════════════╬════════════════════╣
║        2 ║ James ║     750 ║ 0,384615384615 ║ 0,384615384615     ║
║        1 ║ James ║     500 ║ 0,256410256410 ║ 0,641025641025     ║
║        7 ║ Sarah ║     600 ║ 0,444444444444 ║ 0,444444444444     ║
╚══════════╩═══════╩═════════╩════════════════╩════════════════════╝

EDIT 2:

That looks nearly there, the only niggle is it's missing the last row, the third row for James takes him over 0.80 but needs to be included.

WITH cte AS
(
    SELECT *, ROW_NUMBER() OVER(PARTITION BY [User] ORDER BY Revenue DESC) AS rn
    FROM t    
), cte2 AS
(
    SELECT c.Customer, c.[User], c.[Revenue]
           ,percentile         = 1.0 * Revenue / NULLIF(c3.s,0)
           ,running_percentile = 1.0 * c2.s    / NULLIF(c3.s,0)
    FROM cte c
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]
            AND c2.rn <= c.rn) c2
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]) AS c3
) 
SELECT a.*
FROM cte2 a
CROSS APPLY (SELECT MIN(running_percentile) AS rp
             FROM cte2
             WHERE running_percentile >= 0.8
               AND cte2.[User] = a.[User]) AS s
WHERE a.running_percentile <= s.rp;

LiveDemo3

Output:

╔══════════╦═══════╦═════════╦════════════════╦════════════════════╗
║ Customer ║ User  ║ Revenue ║   percentile   ║ running_percentile ║
╠══════════╬═══════╬═════════╬════════════════╬════════════════════╣
║        2 ║ James ║     750 ║ 0,384615384615 ║ 0,384615384615     ║
║        1 ║ James ║     500 ║ 0,256410256410 ║ 0,641025641025     ║
║        3 ║ James ║     450 ║ 0,230769230769 ║ 0,871794871794     ║
║        7 ║ Sarah ║     600 ║ 0,444444444444 ║ 0,444444444444     ║
║        5 ║ Sarah ║     500 ║ 0,370370370370 ║ 0,814814814814     ║
╚══════════╩═══════╩═════════╩════════════════╩════════════════════╝

Looks to be perfect, translated to my big table and returns what I need, spent a good 5 minutes working through it and still can't follow what you've done!

SQL Server 2008 does not support everything in OVER() clause, but ROW_NUMBER does.

First cte just calculate position within a group:

╔═══════════╦════════╦══════════╦════╗
║ Customer  ║ User   ║ Revenue  ║ rn ║
╠═══════════╬════════╬══════════╬════╣
║        2  ║ James  ║     750  ║  1 ║
║        1  ║ James  ║     500  ║  2 ║
║        3  ║ James  ║     450  ║  3 ║
║        8  ║ James  ║     150  ║  4 ║
║        9  ║ James  ║     100  ║  5 ║
║        7  ║ Sarah  ║     600  ║  1 ║
║        5  ║ Sarah  ║     500  ║  2 ║
║        6  ║ Sarah  ║     150  ║  3 ║
║        4  ║ Sarah  ║     100  ║  4 ║
╚═══════════╩════════╩══════════╩════╝

Second cte:

  • c2 subquery calculate running total based on rank from ROW_NUMBER
  • c3 calculate full sum per user

In final query s subquery finds the lowest running total that exceeds 80%.

EDIT 3:

Using ROW_NUMBER is actually redundant.

WITH cte AS
(
    SELECT c.Customer, c.[User], c.[Revenue]
           ,percentile         = 1.0 * Revenue / NULLIF(c3.s,0)
           ,running_percentile = 1.0 * c2.s    / NULLIF(c3.s,0)
    FROM t c
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM t c2
          WHERE c.[User] = c2.[User]
            AND c2.Revenue >= c.Revenue) c2
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM t c2
          WHERE c.[User] = c2.[User]) AS c3
) 
SELECT a.*
FROM cte a
CROSS APPLY (SELECT MIN(running_percentile) AS rp
             FROM cte c2
             WHERE running_percentile >= 0.8
               AND c2.[User] = a.[User]) AS s
WHERE a.running_percentile <= s.rp
ORDER BY [User], Revenue DESC;

LiveDemo4



回答2:

In SQL Server 2012+, you would use the cumulative sum -- much more efficient. In SQL Server 2008, you can do this using a correlated subquery or cross apply:

select t.*,
       sum(t.Revenue*1.0) / sum(t.Revenue) over (partition by user) as [% of Total],
       sum(RunningRevenue*1.0) / sum(t.Revenue) over (partition by user) as [Running Total %]
from t cross apply
     (select sum(Revenue) as RunningRevenue
      from t t2
      where t2.Revenue >= t.Revenue and t2.user = t.user
     ) t2;

Note: The *1.0 is just in case Revenue is stored as an integer. SQL Server does integer division, which would return 0 for both columns on almost all rows.

EDIT:

Add where user = 'James' if you want results only for James.