I would like to calculate the rolling median for a column in Greenplum, i.e. as below:
| x | rolling_median_x |
| -- + ---------------- |
| 4 | 4 |
| 1 | 2.5 |
| 3 | 3 |
| 2 | 2.5 |
| 1 | 2 |
| 6 | 2.5 |
| 9 | 3 |
x
is an integer and for each row rolling_median_x
shows the median of x
for the current and preceding rows. E.g. for the third row rolling_median_x = median(4, 1, 3) = 3
.
Things I've found out so far:
- the
median
function can't be used as a framed window function, i.e.median(x) OVER(RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
- the same is true for many other function such as
percent_rank
ornth_value
- recursive self join is not supported in this version of Greenplum
As a matter of fact I was unable to find proper documentation on which functions can be used as framed window function in Greenplum...
I'm using Greenplum 4.3.4.0 (which is based on Postgres 8.2.15) and updating is not an option unfortunately.
Ouch.
If it was the rolling mean, things would be simple, but the rolling median will be very slow due to the need for sorting. The way to avoid this is to insert the values into a heap or a btree as they come which allows to get the rolling median without sorting on each new value. But this needs custom code.
I would use plpython to implement this:
Rolling median algorithm in C
One remark - a citate from Wikipedia: ORDER BY
Since you need to calculate the median for the current and preceding rows, you must have in the table an additional row that defines the order of rows and can be used to determine which rows precede given row and which ones come after.
Let say some
id
column like this:If you cannot use analytic functions, then try pure SQL.
This article shows various methods of computing the Median with SQL.
I think the Henderson’s Median would be best for our needs:
Just run this query for each row as a dependent subquery, a general idea is like this:
You can find an example implementation in this demo
This query will probably be slow - this is a price you must pay for having ancient version of PostgreSQL