How to delete duplicate entries?

I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add.

What is the fastest approach to removing the offending rows? I have an SQL statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added?

标签： sql postgresql duplicate-removal unique-constraint sql-delete

16条回答

梦醉为红颜

2楼-- · 2019-01-02 19:48

If you have only one or a few duplicated entries, and they are indeed duplicated (that is, they appear twice), you can use the "hidden" ctid column, as proposed above, together with LIMIT:

DELETE FROM mytable WHERE ctid=(SELECT ctid FROM mytable WHERE […] LIMIT 1);

This will delete only the first of the selected rows.

0人赞添加讨论(0) 举报

何处买醉

3楼-- · 2019-01-02 19:48

I'm working with PostgreSQL 8.4. When I ran the proposed code, I found that it was not actually removing the duplicates. In running some tests, I found that adding the "DISTINCT ON (duplicate_column_name)" and the "ORDER BY duplicate_column_name" did the trick. I'm no SQL guru, I found this in the PostgreSQL 8.4 SELECT...DISTINCT doc.

CREATE OR REPLACE FUNCTION remove_duplicates(text, text) RETURNS void AS $$
DECLARE
  tablename ALIAS FOR $1;
  duplicate_column ALIAS FOR $2;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT ON (' || duplicate_column || ') * FROM ' || tablename || ' ORDER BY ' || duplicate_column || ' ASC);';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$$ LANGUAGE plpgsql;

0人赞添加讨论(0) 举报

呛了眼睛熬了心

4楼-- · 2019-01-02 19:50

You can use oid or ctid, which is normally a "non-visible" columns in the table:

DELETE FROM table
 WHERE ctid NOT IN
  (SELECT MAX(s.ctid)
    FROM table s
    GROUP BY s.column_has_be_distinct);

0人赞添加讨论(0) 举报

ら面具成の殇う

5楼-- · 2019-01-02 19:52

For example you could:

CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;

0人赞添加讨论(0) 举报

梦该遗忘

6楼-- · 2019-01-02 19:53

From an old postgresql.org mailing list:

create table test ( a text, b text );

Unique values

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

Duplicate values

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

One more double duplicate

insert into test values ( 'x', 'y');

select oid, a, b from test;

Select duplicate rows

select o.oid, o.a, o.b from test o
    where exists ( select 'x'
                   from test i
                   where     i.a = o.a
                         and i.b = o.b
                         and i.oid < o.oid
                 );

Delete duplicate rows

Note: PostgreSQL dosn't support aliases on the table mentioned in the from clause of a delete.

delete from test
    where exists ( select 'x'
                   from test i
                   where     i.a = test.a
                         and i.b = test.b
                         and i.oid < test.oid
             );

0人赞添加讨论(0) 举报

萌妹纸的霸气范

7楼-- · 2019-01-02 19:55

Some of these approaches seem a little complicated, and I generally do this as:

Given table table, want to unique it on (field1, field2) keeping the row with the max field3:

DELETE FROM table USING table alias 
  WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND
    table.max_field < alias.max_field

For example, I have a table, user_accounts, and I want to add a unique constraint on email, but I have some duplicates. Say also that I want to keep the most recently created one (max id among duplicates).

DELETE FROM user_accounts USING user_accounts ua2
  WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;

Note - USING is not standard SQL, it is a PostgreSQL extension (but a very useful one), but the original question specifically mentions PostgreSQL.

0人赞添加讨论(0) 举报

1 2 3 下一页