I have a table core_message
in Postgres, with millions of rows that looks like this (simplified):
┌────────────────┬──────────────────────────┬─────────────────┬───────────┬──────────────────────────────────────────┐
│ Colonne │ Type │ Collationnement │ NULL-able │ Par défaut │
├────────────────┼──────────────────────────┼─────────────────┼───────────┼──────────────────────────────────────────┤
│ id │ integer │ │ not null │ nextval('core_message_id_seq'::regclass) │
│ mmsi │ integer │ │ not null │ │
│ time │ timestamp with time zone │ │ not null │ │
│ point │ geography(Point,4326) │ │ │ │
└────────────────┴──────────────────────────┴─────────────────┴───────────┴──────────────────────────────────────────┘
Index:
"core_message_pkey" PRIMARY KEY, btree (id)
"core_message_uniq_mmsi_time" UNIQUE CONSTRAINT, btree (mmsi, "time")
"core_messag_mmsi_b36d69_idx" btree (mmsi, "time" DESC)
"core_message_point_id" gist (point)
The mmsi
column is a unique identifier used to identify ships in the world. I'm trying to get the latest row for each mmsi
.
I can get that like this, for example:
SELECT a.* FROM core_message a
JOIN (SELECT mmsi, max(time) AS time FROM core_message GROUP BY mmsi) b
ON a.mmsi=b.mmsi and a.time=b.time;
But this is too slow, 2 seconds+.
So my solution was to create a distinct table containing only the latest rows (100K+ rows max) of the core_message
table, called LatestMessage
.
This table is populated via my application every time new rows have to be added to core_message
.
It worked fine, I'm able to access the table in a matter of milliseconds. But I'd be curious to know if there is a better way to achieve that using only one table and keep the same level of performance for data access.
This answer seems to go in the way of the
DISTINCT ON
answer here, however it also mentions this :Using this other great answer, I find a way to keep the same performance as a distinct table with the use of
LATERAL
. By using a new tabletest_boats
I can do something like this :This table creation take 40+ seconds which is pretty similar to the time taken by the other answer here.
Then, with the help of
LATERAL
:This is blazingly fast, 1+ millisecond.
This will need the modification of my program's logic and the use of a query a bit more complex but I think I can live with that.
For a fast solution without the need to create a new table, check out the answer of @ErwinBrandstetter below
UPDATE: I feel this question is not quite answered yet, as it's not very clear why the other solutions proposed perform poorly here.
I tried the benchmark mentionned here. At first, it would seem that the
DISTINCT ON
way is fast enough if you do a request like the one proposed in the benchmark : +/- 30ms on my computer. But this is because that request uses index only scan. If you include a field that is not in the index,some_column
in the case of the benchmark, the performance will drop to +/- 100ms.Not a dramatic drop in performance yet. That is why we need a benchmark with a bigger data set. Something similar to my case : 40K customers and 8M rows. Here
Let's try again the
DISTINCT ON
with this new table:This takes about 1.5 seconds to complete.
This takes about 35 seconds to complete.
Now, to come back to my first solution above. It is using an index only scan and a
LIMIT
, that's one of the reason why it is extremely fast. If I recraft that query to not use index-only scan and dump the limit :This will take about 500ms, which is still pretty fast.
For a more in-depth benchmark of sort, see my other answer below.
You have put existing answers to good use and came up with great solutions in your own answer. Some missing pieces:
You used this query to create the
test_boats
table with uniquemmsi
:For many rows per boat (
mmsi
), use this fasterRECURSIVE
solution instead:This hardly gets any slower with more rows per boat, as opposed to
DISTINCT ON
which is typically faster with only few rows per boat. Each only needs an index withmmsi
as leading column to be fast.If possible, create that
boats
table and add a FK constraint to it. (Means you have to maintain it.) Then you can go on using the optimalLATERAL
query you have in your answer and never miss any boats. (Orphaned boats may be worth tracking / removing in the long run.)Else, another iteration of that
RECURSIVE
query is the next best thing to get whole rows for the latest position of each boat quickly:You have both of these indexes:
A
UNIQUE
constraint is implemented with all columns in defaultASC
sort order. That cannot be changed. If you don't actually need the constraint, you might replace it with aUNIQUE
index, mostly achieving the same. But there you can add any sort order you like. Related:But there is no need for the use case at hand. Postgres can scan a b-tree index backwards at practically the same speed. And I see nothing here that would require inverted sort order for the two columns. The additional index
core_messag_mmsi_b36d69_idx
is expensive dead freight - unless you have other use cases that actually need it. See:To best use the index
core_message_uniq_mmsi_time
from theUNIQUE
constraint I step through both columns in descending order. That matters.In Postgres, I recommend
distinct on
:For best performance, you want an index on
(mmsi, time desc)
.Another approach using
ROW_NUMBER()
, which is widely supported across RDBMS:This query should benefit of existing index
core_messag_mmsi_b36d69_idx
.Here is a quick performance comparison for the queries mention in this post.
Current setup :
The table
core_message
has 10,904,283 rows and there is 60,740 rows intest_boats
(or 60,740 distinct mmsi incore_message
).And I'm using PostgreSQL 11.5
Query using index-only scan :
1) using
DISTINCT ON
:2) using
RECURSIVE
withLATERAL
:3) Using an extra table with
LATERAL
:Query not using index-only scan :
4) using
DISTINCT ON
withmmsi,time DESC
INDEX
:5) using
DISTINCT ON
with backwardmmsi,time
UNIQUE CONSTRAINT
:6) using
RECURSIVE
withLATERAL
andmmsi,time DESC
INDEX
:7) using
RECURSIVE
withLATERAL
and backwardmmsi,time
UNIQUE CONSTRAINT
:8) Using an extra table with
LATERAL
:Using a dedicated table for the last message:
9) Here is my initial solution, using a distinct table with only the last message. This table is populated as new messages arrive but could also be created like so :
Then the request to get the latest message is as simple as that :
Results :
Average of multiple query (around 5 for the fast one):
1) 9146 ms
2) 728 ms
3) 498 ms
4) 51488 ms
5) 54764 ms
6) 729 ms
7) 778 ms
8) 516 ms
9) 15 ms
Conclusion:
I won't comment on the dedicated table solution, and will keep that for the end.
The additional table (
test_boats
) solution is definitely the winner here but theRECURSIVE
solution is also pretty efficient.There is a huge gap in performance for the
DISTINCT ON
using index-only scan and the one not using it but, the performance gain is rather small for the other efficient query.This makes sense as the major improvement those queries bring is the fact that they don't need to loop over the whole
core_message
table but only on a subset of the uniquemmsi
that is significantly smaller (60K+) in comparison to thecore_message
table size (10M+)As an additional note, there does not seem to be significant improvement in performance for the queries using the
UNIQUE CONSTRAINT
if I drop themmsi,time DESC
INDEX
. But dropping that index will of course save me some space (this index currently take 328MB)About the dedicated table solution:
Each messages stored in the
core_message
table carries both positional informations (position, speed, heading,etc.) AND ship informations (name, callsign, dimensions, etc.), as well as ship identifier (mmsi).To give a bit more background on what I'm actually trying to do : I'm implementing a backend to store messages emitted by ships via the AIS protocol.
As such, every unique mmsi I got, I got it via this protocol. It is not a pre-defined list. It keeps adding new MMSI until I got every ships in the world using AIS.
In that context, a dedicated table with ship informations as last message received makes sense.
I could avoid using such a table as we've seen with the
RECURSIVE
solution, but... a dedicated table is still 50x faster than thisRECURSIVE
solution.That dedicated table is in fact similar to the
test_boat
table, with more information than just themmsi
field. As it is, having a table withmmsi
only field or a table with every last informations of thecore_message
table add the same complexity to my application.In the end, I think I will go for this dedicated table. It will gives me unbeatable speed and I'll still have the possibility to use the
LATERAL
trick oncore_message
, which will gives me more flexibility.