PostgreSQL delete all but the oldest records

2019-06-26 12:33发布

问题:

I have a PostgreSQL database that has multiple entries for the objectid, on multiple devicenames, but there is a unique timestamp for each entry. The table looks something like this:

address | devicename | objectid      |  timestamp       
--------+------------+---------------+------------------------------
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-03 15:37:09.06065+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-03 15:48:33.93128+00
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-05 16:01:59.266779+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-05 16:13:46.843113+00
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-06 01:11:45.853361+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-06 01:23:21.204324+00

I want to delete all but the oldest entry for each odjectid and devicename. In this case I want to delete all but:

1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00

Is there a way do this? Or is it possible to select the oldest entries for both "objectid and devicename" into a temp table?

回答1:

To distill the described result, this would probably simplest and fastest:

SELECT DISTINCT ON (devicename, objectid) *
FROM   tbl
ORDER  BY devicename, objectid, ts DESC;

Details and explanation in this related answer.

From your sample data, I conclude that you are going to delete large portions of the original table. It is probably faster to just TRUNCATE the table (or DROP & recreate, since you should add a surrogate pk column anyway) and write the remaining rows to it. This would also provide you with a prestine table, implicitly clustered (ordered) the way it's best for your queries and save the work that VACUUM would have to do otherwise. And it's probably still faster overall:

I would also strongly advise to add a surrogate primary key to your table, preferably a serial column.

BEGIN;

CREATE TEMP TABLE tmp_tbl ON COMMIT DROP AS
SELECT DISTINCT ON (devicename, objectid) *
FROM   tbl
ORDER  BY devicename, objectid, ts DESC;

TRUNCATE tbl;
ALTER TABLE tbl ADD column tbl_id serial PRIMARY KEY;

-- or, if you can afford to drop & recreate:
-- DROP TABLE tbl;
-- CREATE TABLE tbl (
--   tbl_id serial PRIMARY KEY
-- , address text
-- , devicename text
-- , objectid text
-- , ts timestamp);

INSERT INTO tbl (address, devicename, objectid, ts)
SELECT address, devicename, objectid, ts
FROM   tmp_tbl;

COMMIT;

Do it all within a transaction to make sure you are not going to fail half way through.

This is fast as long as your setting for temp_buffers is big enough to hold the temporary table. Else the system will start to swap data to disk and performance takes a dive. You can set temp_buffers just for the current session like this:

SET temp_buffers = 1000MB;

So you don't waste RAM that you don't normally need for temp_buffers. Has to be before the first use of temporary objects in the session. More information in this related answer.

Also, as the INSERT follows a TRUNCATE inside a transaction, it will be easy on the Write Ahead Log - improving performance.

Consider CREATE TABLE AS for the alternative route:

  • What causes large INSERT to slow down and disk usage to explode?

The only downside: You need an exclusive lock on the table. This may be a problem in databases with heavy concurrent load.

Finally, never use timestamp as column name. It's a reserved word in every SQL standard and a type name in PostgreSQL. I renamed the column to ts as you may have noticed.



回答2:

This should do it:

delete from devices
using (
   select ctid as cid, 
          row_number() over (partition by devicename, objectid order by timestamp asc) as rn
   from devices
) newest
where newest.cid = devices.ctid
and newest.rn <> 1;

It creates a derived table that will assign unique numbers to each combination of (address, devicename, objectid) giving the earliest one (the one with the smallest timestamp value) the number 1. Then this result is used to delete all those that do not have the number 1. The virtual column ctid is used to uniquely identify those rows (it's an internal identifier supplied by Postgres).

Note that for deleting a really large amount of rows, Erwin's approach will most definitely be faster.

SQLFiddle demo: http://www.sqlfiddle.com/#!1/5d9fe/2



回答3:

DELETE FROM DEVICES D WHERE d.timestamp = (SELECT min(timestamp) FROM DEVICES WHERE objectid = d.objectid and device = d.device)



回答4:

My suggestion is to use a subquery, that checks existance of record with older timestamp:

DELETE FROM tablename
WHERE EXISTS(
  SELECT * FROM tablename a
  WHERE tablenmae.address = a.address
    AND tablename.devicename = a.devicename
    AND tablename.objectid = a.objectid
    AND a.timestamp < tablename.timestamp
)

Query for selecting oldest records will be look like this:

SELECT address, devicename, objectid, MIN(timestamp)
FROM tablename
GROUP BY address, devicename, objectid


回答5:

This should work assuming that address, devicename and objectid make up a unique identifier

DELETE FROM tablename 
WHERE 
  address || devicename || objectid || timestamp NOT IN 
  (SELECT 
     address || devicename || objectid || min(timestamp) 
   FROM tablename 
   GROUP BY address, devicename, objectid)

This uses a concatenated string that consists of the unique columns to tie the selects together. One finds the min date for that unique combination, the next deletes those records from the table. Probably not the most efficient, but it should work.