I have a PostgreSQL database that has multiple entries for the objectid
, on multiple devicenames
, but there is a unique timestamp
for each entry. The table looks something like this:
address | devicename | objectid | timestamp
--------+------------+---------------+------------------------------
1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00
1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-03 15:37:09.06065+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-03 15:48:33.93128+00
1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-05 16:01:59.266779+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-05 16:13:46.843113+00
1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-06 01:11:45.853361+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-06 01:23:21.204324+00
I want to delete all but the oldest entry for each odjectid
and devicename
. In this case I want to delete all but:
1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00
Is there a way do this? Or is it possible to select the oldest entries for both "objectid
and devicename
" into a temp table?
To distill the described result, this would probably simplest and fastest:
SELECT DISTINCT ON (devicename, objectid) *
FROM tbl
ORDER BY devicename, objectid, ts DESC;
Details and explanation in this related answer.
From your sample data, I conclude that you are going to delete large portions of the original table. It is probably faster to just TRUNCATE
the table (or DROP
& recreate, since you should add a surrogate pk column anyway) and write the remaining rows to it. This would also provide you with a prestine table, implicitly clustered (ordered) the way it's best for your queries and save the work that VACUUM would have to do otherwise. And it's probably still faster overall:
I would also strongly advise to add a surrogate primary key to your table, preferably a serial
column.
BEGIN;
CREATE TEMP TABLE tmp_tbl ON COMMIT DROP AS
SELECT DISTINCT ON (devicename, objectid) *
FROM tbl
ORDER BY devicename, objectid, ts DESC;
TRUNCATE tbl;
ALTER TABLE tbl ADD column tbl_id serial PRIMARY KEY;
-- or, if you can afford to drop & recreate:
-- DROP TABLE tbl;
-- CREATE TABLE tbl (
-- tbl_id serial PRIMARY KEY
-- , address text
-- , devicename text
-- , objectid text
-- , ts timestamp);
INSERT INTO tbl (address, devicename, objectid, ts)
SELECT address, devicename, objectid, ts
FROM tmp_tbl;
COMMIT;
Do it all within a transaction to make sure you are not going to fail half way through.
This is fast as long as your setting for temp_buffers
is big enough to hold the temporary table. Else the system will start to swap data to disk and performance takes a dive. You can set temp_buffers
just for the current session like this:
SET temp_buffers = 1000MB;
So you don't waste RAM that you don't normally need for temp_buffers
. Has to be before the first use of temporary objects in the session. More information in this related answer.
Also, as the INSERT
follows a TRUNCATE
inside a transaction, it will be easy on the Write Ahead Log - improving performance.
Consider CREATE TABLE AS
for the alternative route:
- What causes large INSERT to slow down and disk usage to explode?
The only downside: You need an exclusive lock on the table. This may be a problem in databases with heavy concurrent load.
Finally, never use timestamp
as column name. It's a reserved word in every SQL standard and a type name in PostgreSQL. I renamed the column to ts
as you may have noticed.
This should do it:
delete from devices
using (
select ctid as cid,
row_number() over (partition by devicename, objectid order by timestamp asc) as rn
from devices
) newest
where newest.cid = devices.ctid
and newest.rn <> 1;
It creates a derived table that will assign unique numbers to each combination of (address, devicename, objectid) giving the earliest one (the one with the smallest timestamp
value) the number 1. Then this result is used to delete all those that do not have the number 1. The virtual column ctid
is used to uniquely identify those rows (it's an internal identifier supplied by Postgres).
Note that for deleting a really large amount of rows, Erwin's approach will most definitely be faster.
SQLFiddle demo: http://www.sqlfiddle.com/#!1/5d9fe/2
DELETE FROM DEVICES D WHERE d.timestamp = (SELECT min(timestamp) FROM DEVICES WHERE objectid = d.objectid and device = d.device)
My suggestion is to use a subquery, that checks existance of record with older timestamp:
DELETE FROM tablename
WHERE EXISTS(
SELECT * FROM tablename a
WHERE tablenmae.address = a.address
AND tablename.devicename = a.devicename
AND tablename.objectid = a.objectid
AND a.timestamp < tablename.timestamp
)
Query for selecting oldest records will be look like this:
SELECT address, devicename, objectid, MIN(timestamp)
FROM tablename
GROUP BY address, devicename, objectid
This should work assuming that address, devicename and objectid
make up a unique identifier
DELETE FROM tablename
WHERE
address || devicename || objectid || timestamp NOT IN
(SELECT
address || devicename || objectid || min(timestamp)
FROM tablename
GROUP BY address, devicename, objectid)
This uses a concatenated string that consists of the unique columns to tie the selects together. One finds the min date for that unique combination, the next deletes those records from the table. Probably not the most efficient, but it should work.