Background:
The original case was very simple. Calculate running total per user from highest revenue to lowest:
CREATE TABLE t(Customer INTEGER NOT NULL PRIMARY KEY
,"User" VARCHAR(5) NOT NULL
,Revenue INTEGER NOT NULL);
INSERT INTO t(Customer,"User",Revenue) VALUES
(001,'James',500),(002,'James',750),(003,'James',450),
(004,'Sarah',100),(005,'Sarah',500),(006,'Sarah',150),
(007,'Sarah',600),(008,'James',150),(009,'James',100);
Query:
SELECT *,
1.0 * Revenue/SUM(Revenue) OVER(PARTITION BY "User") AS percentage,
1.0 * SUM(Revenue) OVER(PARTITION BY "User" ORDER BY Revenue DESC)
/SUM(Revenue) OVER(PARTITION BY "User") AS running_percentage
FROM t;
Output:
╔════╦═══════╦═════════╦════════════╦════════════════════╗
║ ID ║ User ║ Revenue ║ percentage ║ running_percentage ║
╠════╬═══════╬═════════╬════════════╬════════════════════╣
║ 2 ║ James ║ 750 ║ 0.38 ║ 0.38 ║
║ 1 ║ James ║ 500 ║ 0.26 ║ 0.64 ║
║ 3 ║ James ║ 450 ║ 0.23 ║ 0.87 ║
║ 8 ║ James ║ 150 ║ 0.08 ║ 0.95 ║
║ 9 ║ James ║ 100 ║ 0.05 ║ 1 ║
║ 7 ║ Sarah ║ 600 ║ 0.44 ║ 0.44 ║
║ 5 ║ Sarah ║ 500 ║ 0.37 ║ 0.81 ║
║ 6 ║ Sarah ║ 150 ║ 0.11 ║ 0.93 ║
║ 4 ║ Sarah ║ 100 ║ 0.07 ║ 1 ║
╚════╩═══════╩═════════╩════════════╩════════════════════╝
It could be calculated differently using specific windowed functions.
Now let's assume that we cannot use windowed SUM
and rewrite it:
SELECT c.Customer, c."User", c."Revenue"
,1.0 * Revenue / NULLIF(c3.s,0) AS percentage
,1.0 * c2.s / NULLIF(c3.s,0) AS running_percentage
FROM t c
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM t c2
WHERE c."User" = c2."User"
AND c2.Revenue >= c.Revenue) AS c2
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM t c2
WHERE c."User" = c2."User") AS c3
ORDER BY "User", Revenue DESC;
I have used CROSS APPLY
because I do not like correlated subqueries in SELECT
colums list and c3
is used twice.
Everything work as it should. But when we look closer c2
and c3
are very similiar. So why not combine them and use simple conditional aggregation:
SELECT c.Customer, c."User", c."Revenue"
,1.0 * Revenue / NULLIF(c2.sum_total,0) AS percentage
,1.0 * c2.sum_running / NULLIF(c2.sum_total,0) AS running_percentage
FROM t c
CROSS APPLY
(SELECT SUM(Revenue) AS sum_total,
SUM(CASE WHEN c2.Revenue >= c.Revenue THEN Revenue ELSE 0 END)
AS sum_running
FROM t c2
WHERE c."User" = c2."User") AS c2
ORDER BY "User", Revenue DESC;
Unfortunately it is not possible.
Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.
Of course I could circumvent it wrapping with another subquery, but it becomes a bit "ugly":
SELECT c.Customer, c."User", c."Revenue"
,1.0 * Revenue / NULLIF(c2.sum_total,0) AS percentage
,1.0 * c2.sum_running / NULLIF(c2.sum_total,0) AS running_percentage
FROM t c
CROSS APPLY
( SELECT SUM(Revenue) AS sum_total,
SUM(running_revenue) AS sum_running
FROM (SELECT Revenue,
CASE WHEN c2.Revenue >= c.Revenue THEN Revenue ELSE 0 END
AS running_revenue
FROM t c2
WHERE c."User" = c2."User") AS sub
) AS c2
ORDER BY "User", Revenue DESC
Postgresql
version. The only difference is LATERAL
instead of CROSS APPLY
.
SELECT c.Customer, c."User", c.Revenue
,1.0 * Revenue / NULLIF(c2.sum_total,0) AS percentage
,1.0 * c2.running_sum / NULLIF(c2.sum_total,0) AS running_percentage
FROM t c
,LATERAL (SELECT SUM(Revenue) AS sum_total,
SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END)
AS running_sum
FROM t c2
WHERE c."User" = c2."User") c2
ORDER BY "User", Revenue DESC;
It works very nice.
SQLite
/MySQL
version (that is why I preferLATERAL/CROSS APPLY
):
SELECT c.Customer, c."User", c.Revenue,
1.0 * Revenue / (SELECT SUM(Revenue)
FROM t c2
WHERE c."User" = c2."User") AS percentage,
1.0 * (SELECT SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END)
FROM t c2
WHERE c."User" = c2."User") /
(SELECT SUM(c2.Revenue)
FROM t c2
WHERE c."User" = c2."User") AS running_percentage
FROM t c
ORDER BY "User", Revenue DESC;
SQLFiddleDemo-SQLite
SQLFiddleDemo-MySQL
I've read Aggregates with an Outer Reference:
The source for the restriction is in the
SQL-92
standard, andSQL Server
inherited it from theSybase
codebase. The problem is that SQL Server needs to figure out which query will compute the aggregate.
I do not search for answers that only show how to circumvent it.
The questions are:
- Which part of standard disallow or interfere with it?
- Why other RDBMSes do not have problem with this kind of outer dependency?
- Do they extend
SQL Standard
andSQL Server
behaves as it should orSQL Server
does not implement it fully(correctly?)?.
I would be very grateful for references to:
ISO standard
(92 or newer)- SQL Server Standards Support
- official documenation from any RDBMS that explains it (
SQL Server/Postgresql/Oracle/...
).
EDIT:
I know that SQL-92
does not have concept of LATERAL
. But version with subqueries (like in SQLite/MySQL
) does not work too.
EDIT 2:
To simplify it a bit, let's check only correlated subquery only:
SELECT c.Customer, c."User", c.Revenue,
1.0*(SELECT SUM(CASE WHEN c2.Revenue >= c.Revenue THEN c2.Revenue ELSE 0 END)
FROM t c2
WHERE c."User" = c2."User")
/ (SELECT SUM(c2.Revenue)
FROM t c2
WHERE c."User" = c2."User") AS running_percentage
FROM t c
ORDER BY "User", Revenue DESC;
The version above works fine in MySQL/SQLite/Postgresql
.
In SQL Server
we get error. After wraping it with subquery to "flatten" it to one level it works:
SELECT c.Customer, c."User", c.Revenue,
1.0 * (
SELECT SUM(CASE WHEN r1 >= r2 THEN r1 ELSE 0 END)
FROM (SELECT c2.Revenue AS r1, c.Revenue r2
FROM t c2
WHERE c."User" = c2."User") AS S) /
(SELECT SUM(c2.Revenue)
FROM t c2
WHERE c."User" = c2."User") AS running_percentage
FROM t c
ORDER BY "User", Revenue DESC;
The point of this question is how does SQL standard
regulate it.
There is an easier solution:
I am not sure why or if this limitation is in the SQL '92 standard. I did have it pretty well memorized 20 or so years ago, but I don't recall that particular limitation.
I should note:
There is no such limitation in the SQL standard for
LATERAL
.CROSS APPLY
is a vendor-specific extension from Microsoft (Oracle adopted it later for compatibility) and its limitations are obviously not owed to the ANSI SQL standard, since the MS feature pre-dates the standard.LATERAL
according to ANSI SQL is basically just a modifier for joins to allow lateral references in the join tree. There is no limit to the number of columns that can be referenced.I wouldn't see a reason for the odd restriction to begin with. Maybe it's because
CROSS APPLY
was originally intended to allow table-valued functions, which was later extended to allow sub-SELECT
s.The Postgres manual explains
LATERAL
like this:The Postgres version of your query (without the more elegant window functions) can be simpler:
Postgres 9.4+ has the more elegant aggregate
FILTER
for conditional aggregates.That was backwards, I thought aboutNULLIF
was redundant.revenue
is definedNOT NULL
, the aggregates are guaranteed to find 1 or more rows, and theLATERAL
sub-SELECT
is joined in aCROSS JOIN
, sosum_total
cannot beNULL
.COALESCE
.NULLIF
makes sense, I only suggest a minor simplification.Cast
sum_total
tonumeric
once.Round result to match your desired result.