How to delete duplicate rows without unique identi

2019-01-08 11:41发布

I have duplicate rows in my table and I want to delete duplicates in the most efficient way since the table is big. After some research, I have come up with this query:

WITH TempEmp AS
(
SELECT name, ROW_NUMBER() OVER(PARTITION by name, address, zipcode ORDER BY name) AS duplicateRecCount
FROM mytable
)
-- Now Delete Duplicate Records
DELETE FROM TempEmp
WHERE duplicateRecCount > 1;

But it only works in SQL, not in Netezza. It would seem that it does not like the DELETE after the WITH clause?

9条回答
Emotional °昔
2楼-- · 2019-01-08 12:12

We can use a window function for very effective removal of duplicate rows:

DELETE FROM tab 
  WHERE id IN (SELECT id 
                  FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), id 
                           FROM tab) x 
                 WHERE x.row_number > 1);

Some PostgreSQL's optimized version (with ctid):

DELETE FROM tab 
  WHERE ctid = ANY(ARRAY(SELECT ctid 
                  FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), ctid 
                           FROM tab) x 
                 WHERE x.row_number > 1));
查看更多
对你真心纯属浪费
3楼-- · 2019-01-08 12:20

If you have no other unique identifier, you can use ctid:

delete from mytable
    where exists (select 1
                  from mytable t2
                  where t2.name = mytable.name and
                        t2.address = mytable.address and
                        t2.zip = mytable.zip and
                        t2.ctid > mytable.ctid
                 );

It is a good idea to have a unique, auto-incrementing id in every table. Doing a delete like this is one important reason why.

查看更多
爱情/是我丢掉的垃圾
4楼-- · 2019-01-08 12:20

From the documentation delete duplicate rows

A frequent question in IRC is how to delete rows that are duplicates over a set of columns, keeping only the one with the lowest ID. This query does that for all rows of tablename having the same column1, column2, and column3.

DELETE FROM tablename
WHERE id IN (SELECT id
          FROM (SELECT id,
                         ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
                 FROM tablename) t
          WHERE t.rnum > 1);

Sometimes a timestamp field is used instead of an ID field.

查看更多
登录 后发表回答