I'm working on redshift - I have a table like
userid oid version number_of_objects
1 ab 1 10
1 ab 2 20
1 ab 3 17
1 ab 4 16
1 ab 5 14
1 cd 1 5
1 cd 2 6
1 cd 3 9
1 cd 4 12
2 ef 1 4
2 ef 2 3
2 gh 1 16
2 gh 2 12
2 gh 3 21
I would like to select from this table the maximum version number for every oid
and get the userid
and the number of the row.
When I tried this, unfortunately I've got the whole table back:
SELECT MAX(version), oid, userid, number_of_objects
FROM table
GROUP BY oid, userid, number_of_objects
LIMIT 10;
But the real result, what I'm looking for would be:
userid oid MAX(version) number_of_objects
1 ab 5 14
1 cd 4 12
2 ef 2 3
2 gh 3 21
Somehow distinct on doesn't work either, it says:
SELECT DISTINCT ON is not supported
Do you have any idea?
UPDATE: in the meantime I came up with this workaround, but I feel like this is not the smartest solution. It's also very slow. But it works at least. Just in case:
SELECT * FROM table,
(SELECT MAX(version) as maxversion, oid, userid
FROM table
GROUP BY oid, userid
) as maxtable
WHERE table.oid = maxtable.oid
AND table.userid = maxtable.userid
AND table.version = maxtable.version
LIMIT 100;
Do you have any better solution?
If redshift does have window functions, you might try this:
SELECT *
FROM (
select oid,
userid,
version,
max(version) over (partition by oid, userid) as max_version,
from the_table
) t
where version = max_version;
I would expect that to be faster than a self join with a group by
.
Another option would be to use the row_number()
function:
SELECT *
FROM (
select oid,
userid,
version,
row_number() over (partition by oid, userid order by version desc) as rn,
from the_table
) t
where rn = 1;
It's more a matter of personal taste which one to use. Performance wise I wouldn't expect a difference.
select distinct
first_value(userid) over(
partition by oid
order by version desc
rows between unbounded preceding and unbounded following
) as userid
, oid
, first_value(version) over(
partition by oid
order by version desc
rows between unbounded preceding and unbounded following
) as max_version
, first_value(number_of_objects) over(
partition by oid
order by version desc
rows between unbounded preceding and unbounded following
) as number_of_objects
from table
order by oid;
AWS Redshift Documentation first_value
Don't forget nulls last
in the ordering if version
is nullable.
Long story short: Horses for courses.
The author's approach should be faster on smaller tables and to extract sample data but the window methods will be more consistent in performance and faster on the whole table.
Below are a few explain results I made on table with 17 columns, 184 121 798 rows and 12 809 740 unique ids (an average of 14 versions per id, but can go up to 40).
Quick summary:
Tomi's approach: cost=5983958.76..67801689853856.94 (6*10^6 for first row, and 7*10^13 for whole table)
@a_horse_with_no_name approach: cost=1000027117538.39..1000031720583.59 (10^12 for any query)
@Merlin: almost exact to above method.
Original approach
explain
SELECT * FROM table t,
(SELECT MAX(version) as maxversion, id
FROM table
GROUP BY id
) as maxtable
WHERE t.id = maxtable.id
AND t.version = maxtable.maxversion;
XN Hash Join DS_DIST_NONE (cost=5983958.76..67801689853856.94 rows=63811541 width=590)
Hash Cond: ((("outer".id)::text = ("inner".id)::text) AND ("outer".version = "inner".maxversion))
-> XN Seq Scan on equipment_visits ev (cost=0.00..1841218.08 rows=184121808 width=418)
-> XN Hash (cost=5063349.72..5063349.72 rows=184121808 width=172)
-> XN Subquery Scan maxtable (cost=2761827.12..5063349.72 rows=184121808 width=172)
-> XN HashAggregate (cost=2761827.12..3222131.64 rows=184121808 width=44)
-> XN Seq Scan on equipment_visits (cost=0.00..1841218.08 rows=184121808 width=44)
So, cost of the first and all rows are 5983958.76 (6*10^6) and 67801689853856.94 (7*10^13).
a_horse_with_no_name's approach
Both solutions provided by the @a_horse_with_no_name have almost exact plan so I'll paste only one of them
explain
SELECT *
FROM (
select *,
row_number() over (partition by id order by version desc) as rn
from table
)
where rn = 1;
gives
Filter: (rn = 1)
-> XN Window (cost=1000027117538.39..1000029419060.99 rows=184121808 width=44)
Partition: id
Order: version
-> XN Sort (cost=1000027117538.39..1000027577842.91 rows=184121808 width=44)
Sort Key: id, version
-> XN Seq Scan on table (cost=0.00..1841218.08 rows=184121808 width=44)
Merlin's approach
Solution provided by @Merlin seems to be incomplete, as it doesn't return all values for the latest version, but it has similar performance to second option
explain
select distinct
id
, first_value(version) over(
partition by id
order by version desc
rows between unbounded preceding and unbounded following
) as max_version
, first_value(additional_col) over(
partition by id
order by version desc
rows between unbounded preceding and unbounded following
) as additional_col
from table t;
gives
XN Unique (cost=1000027117538.39..1000032180888.11 rows=184121808 width=84)
-> XN Window (cost=1000027117538.39..1000030799974.55 rows=184121808 width=84)
Partition: id
Order: version
-> XN Sort (cost=1000027117538.39..1000027577842.91 rows=184121808 width=84)
Sort Key: id, version
-> XN Seq Scan on table (cost=0.00..1841218.08 rows=184121808 width=84)