What is the best way to remove duplicate rows from a fairly large SQL Server
table (i.e. 300,000+ rows)?
The rows, of course, will not be perfect duplicates because of the existence of the RowID
identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
The following query is useful to delete duplicate rows. The table in this example has
ID
as an identity column and the columns which have duplicate data areColumn1
,Column2
andColumn3
.The following script shows usage of
GROUP BY
,HAVING
,ORDER BY
in one query, and returns the results with duplicate column and its count.Quick and Dirty to delete exact duplicated rows (for small tables):
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
Postgres:
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
This is the easiest way to delete duplicate record
http://askme.indianyouth.info/details/how-to-dumplicate-record-from-table-in-using-sql-105