可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have duplicate rows in my table and I want to delete duplicates in the most efficient way since the table is big. After some research, I have come up with this query:
WITH TempEmp AS
(
SELECT name, ROW_NUMBER() OVER(PARTITION by name, address, zipcode ORDER BY name) AS duplicateRecCount
FROM mytable
)
-- Now Delete Duplicate Records
DELETE FROM TempEmp
WHERE duplicateRecCount > 1;
But it only works in SQL, not in Netezza. It would seem that it does not like the DELETE
after the WITH
clause?
回答1:
I like @erwin-brandstetter 's solution, but wanted to show a solution with the USING
keyword:
DELETE FROM table_with_dups T1
USING table_with_dups T2
WHERE T1.ctid < T2.ctid -- delete the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;
If you want to review the records before deleting them, then simply replace DELETE
with SELECT *
and USING
with a comma ,
, i.e.
SELECT * FROM table_with_dups T1
, table_with_dups T2
WHERE T1.ctid < T2.ctid -- select the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;
Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...)
clause as those generate a lot of rows in the subquery.
If you rewrite the query to use IN (...)
then it performs similarly to the solution presented here, but the SQL code becomes much less concise.
Update 2: If you have NULL
values in one of the key columns (which you really shouldn't IMO), then you can use COALESCE()
in the condition for that column, e.g.
AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]')
回答2:
If you have no other unique identifier, you can use ctid
:
delete from mytable
where exists (select 1
from mytable t2
where t2.name = mytable.name and
t2.address = mytable.address and
t2.zip = mytable.zip and
t2.ctid > mytable.ctid
);
It is a good idea to have a unique, auto-incrementing id in every table. Doing a delete
like this is one important reason why.
回答3:
In a perfect world, every table has a unique identifier of some sort.
In the absence of any unique column (or combination thereof), use the ctid
column:
DELETE FROM tbl
WHERE ctid NOT IN (
SELECT min(ctid) -- ctid is NOT NULL by definition
FROM tbl
GROUP BY name, address, zipcode); -- list columns defining duplicates
The above query is short, conveniently listing column names only once. NOT IN (SELECT ...)
is a tricky query style when NULL values can be involved, but the system column ctid
is never NULL. See:
- Find records where join doesn't exist
Using EXISTS
as demonstrated by @Gordon is typically faster. So is a self-join with the USING
clause like @isapir added later. Both should result in the same query plan.
But note an important difference: These other queries treat NULL
values as not equal, while GROUP BY
(or DISTINCT
or DISTINCT ON ()
) treats NULL values as equal. Does not matter if key columns are defined NOT NULL
. Else, depending on your definition of "duplicate", you'll need one or the other approach. Or use IS NOT DISTINCT FROM
in comparison of values (which may not be able to use some indexes).
Disclaimer:
ctid
is an internal implementation detail of Postgres, it's not in the SQL standard and can be changed between major versions without warning (even if that's very unlikely). Its values can change between commands due to background processes or concurrent write operations (but not within the same command).
Related:
Aside:
The target of a DELETE
statement cannot be the CTE, only the underlying table. That's a spillover from SQL Server - as is your whole approach.
回答4:
Here is what I came up with, using a group by
DELETE FROM mytable
WHERE id NOT in (
SELECT MIN(id)
FROM mytable
GROUP BY name, address, zipcode
)
It deletes the duplicates, preserving the oldest record that has duplicates.
回答5:
We can use a window function for very effective removal of duplicate rows:
DELETE FROM tab
WHERE id IN (SELECT id
FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), id
FROM tab) x
WHERE x.row_number > 1);
Some PostgreSQL's optimized version (with ctid):
DELETE FROM tab
WHERE ctid = ANY(ARRAY(SELECT ctid
FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), ctid
FROM tab) x
WHERE x.row_number > 1));
回答6:
The valid syntax is specified at http://www.postgresql.org/docs/current/static/sql-delete.html
I would ALTER your table to add a unique auto-incrementing primary key id so that you can run a query like the following which will keep the first of each set of duplicates (ie the one with the lowest id). Note that adding the key is a bit more complicated in Postgres than some other DBs.
DELETE FROM mytable d USING (
SELECT min(id), name, address, zip
FROM mytable
GROUP BY name, address, zip HAVING COUNT() > 1
) AS k
WHERE d.id <> k.id
AND d.name=k.name
AND d.address=k.address
AND d.zip=k.zip;
回答7:
If you want to keep one row out of duplicate rows in the table.
create table some_name_for_new_table as
(select * from (select *,row_number() over (partition by pk_id) row_n from
your_table_name_where_duplicates_are_present) a where row_n = 1);
This will create a table which you can copy.
Before copying table please delete the column 'row_n'
回答8:
From the documentation delete duplicate rows
A frequent question in IRC is how to delete rows that are duplicates over a set of columns, keeping only the one with the lowest ID.
This query does that for all rows of tablename having the same column1, column2, and column3.
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
Sometimes a timestamp field is used instead of an ID field.
回答9:
If you want a unique identifier for every row, you could just add one (a serial, or a guid), and treat it like a surrogate key.
CREATE TABLE thenames
( name text not null
, address text not null
, zipcode text not null
);
INSERT INTO thenames(name,address,zipcode) VALUES
('James', 'main street', '123' )
,('James', 'main street', '123' )
,('James', 'void street', '456')
,('Alice', 'union square' , '123')
;
SELECT*FROM thenames;
-- add a surrogate key
ALTER TABLE thenames
ADD COLUMN seq serial NOT NULL PRIMARY KEY
;
SELECT*FROM thenames;
DELETE FROM thenames del
WHERE EXISTS(
SELECT*FROM thenames x
WHERE x.name=del.name
AND x.address=del.address
AND x.zipcode=del.zipcode
AND x.seq < del.seq
);
-- add the unique constrain,so that new dupplicates cannot be created in the future
ALTER TABLE thenames
ADD UNIQUE (name,address,zipcode)
;
SELECT*FROM thenames;