I would like to calculate the rolling median for a column in Greenplum, i.e. as below:
| x | rolling_median_x |
| -- + ---------------- |
| 4 | 4 |
| 1 | 2.5 |
| 3 | 3 |
| 2 | 2.5 |
| 1 | 2 |
| 6 | 2.5 |
| 9 | 3 |
x
is an integer and for each row rolling_median_x
shows the median of x
for the current and preceding rows. E.g. for the third row rolling_median_x = median(4, 1, 3) = 3
.
Things I've found out so far:
- the
median
function can't be used as a framed window function, i.e. median(x) OVER(RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
- the same is true for many other function such as
percent_rank
or nth_value
- recursive self join is not supported in this version of Greenplum
As a matter of fact I was unable to find proper documentation on which functions can be used as framed window function in Greenplum...
I'm using Greenplum 4.3.4.0 (which is based on Postgres 8.2.15) and updating is not an option unfortunately.
One remark - a citate from Wikipedia: ORDER BY
ORDER BY is the only way to sort the rows in the result set. Without
this clause, the relational database system may return the rows in any
order. If an ordering is required, the ORDER BY must be provided in
the SELECT statement sent by the application. Although some database
systems allow the specification of an ORDER BY clause in subselects or
view definitions, the presence there has no effect. A view is a
logical relational table, and the relational model mandates that a
table is a set of rows, implying no sort order whatsoever.
Since you need to calculate the median for the current and preceding rows, you must have in the table an additional row that defines the order of rows and can be used to determine which rows precede given row and which ones come after.
Let say some id
column like this:
| id | x | rolling_median_x |
|----|---|------------------|
| 1 | 4 | 4 |
| 2 | 1 | 2.5 |
| 3 | 3 | 3 |
| 4 | 2 | 2.5 |
| 5 | 1 | 2 |
| 6 | 6 | 2.5 |
| 7 | 9 | 3 |
If you cannot use analytic functions, then try pure SQL.
This article shows various methods of computing the Median with SQL.
I think the Henderson’s Median would be best for our needs:
SELECT CASE COUNT(*) % 2
WHEN 0 -- even sized table
THEN (P1.part_wgt + MIN(CASE WHEN P2.part_wgt > P1.part_wgt
THEN P2.part_wgt
ELSE NULL END))/2.0
ELSE P1.part_wgt --odd sized table
END AS median
FROM Parts AS P1, Parts AS P2
GROUP BY P1.part_wgt
HAVING COUNT(CASE WHEN P1.part_wgt >= P2.part_wgt
THEN 1
ELSE NULL END)
= (COUNT(*) + 1) / 2;
Just run this query for each row as a dependent subquery, a general idea is like this:
SELECT t.*, (
SELECT .... Henderson's query FROM table x
WHERE x.id <= t.id
......
) As our_median
FROM table t
You can find an example implementation in this demo
SELECT t.*, (
SELECT CASE COUNT(*) % 2
WHEN 0 -- even sized table
THEN (P1.x + MIN(CASE WHEN P2.x > P1.x
THEN P2.x
ELSE NULL END))/2.0
ELSE P1.x --odd sized table
END AS median
FROM Table333 AS P1, Table333 AS P2
WHERE p1.id <= t.id AND p2.id <= t.id
GROUP BY P1.x
HAVING COUNT(CASE WHEN P1.x >= P2.x
THEN 1
ELSE NULL END)
= (COUNT(*) + 1) / 2
) as Our_median
FROM Table333 t;
| id | x | rolling_median_x | our_median |
|----|---|------------------|------------|
| 1 | 4 | 4 | 4 |
| 2 | 1 | 2.5 | 2.5 |
| 3 | 3 | 3 | 3 |
| 4 | 2 | 2.5 | 2.5 |
| 5 | 1 | 2 | 2 |
| 6 | 6 | 2.5 | 2.5 |
| 7 | 9 | 3 | 3 |
This query will probably be slow - this is a price you must pay for having ancient version of PostgreSQL
I'm using psql 8.2.15 and updating is not an option unfortunately.
Ouch.
If it was the rolling mean, things would be simple, but the rolling median will be very slow due to the need for sorting. The way to avoid this is to insert the values into a heap or a btree as they come which allows to get the rolling median without sorting on each new value. But this needs custom code.
I would use plpython to implement this:
Rolling median algorithm in C