What is the best way to remove duplicate rows from a fairly large SQL Server
table (i.e. 300,000+ rows)?
The rows, of course, will not be perfect duplicates because of the existence of the RowID
identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
Assuming no nulls, you
GROUP BY
the unique columns, andSELECT
theMIN (or MAX)
RowId as the row to keep. Then, just delete everything that didn't have a row id:In case you have a GUID instead of an integer, you can replace
with
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
Create new blank table with the same structure
Execute query like this
Then execute this query
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/