How can I remove duplicate rows?

2018-12-30 23:28发布

What is the best way to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows)?

The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

30条回答
零度萤火
2楼-- · 2018-12-31 00:02
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)

INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)

--SELECT * FROM car

;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)

DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
查看更多
孤独寂梦人
3楼-- · 2018-12-31 00:02

I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/

with MYCTE as (
  SELECT ROW_NUMBER() OVER (
    PARTITION BY DuplicateKey1
                ,DuplicateKey2 -- optional
    ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
  ) RN
  FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1
查看更多
大哥的爱人
4楼-- · 2018-12-31 00:04

Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:

DELETE FROM MyTable
LEFT OUTER JOIN (
   SELECT MIN(RowId) as RowId, Col1, Col2, Col3 
   FROM MyTable 
   GROUP BY Col1, Col2, Col3
) as KeepRows ON
   MyTable.RowId = KeepRows.RowId
WHERE
   KeepRows.RowId IS NULL

In case you have a GUID instead of an integer, you can replace

MIN(RowId)

with

CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
查看更多
泪湿衣
5楼-- · 2018-12-31 00:09

From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.

I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:

-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism 
ON stories 
after INSERT, UPDATE 
AS 
    DECLARE @cnt AS INT 

    SELECT @cnt = Count(*) 
    FROM   stories 
           INNER JOIN inserted 
                   ON ( stories.story = inserted.story 
                        AND stories.story_id != inserted.story_id ) 

    IF @cnt > 0 
      BEGIN 
          RAISERROR('plagiarism detected',16,1) 

          ROLLBACK TRANSACTION 
      END 

Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?

查看更多
明月照影归
6楼-- · 2018-12-31 00:11
  1. Create new blank table with the same structure

  2. Execute query like this

    INSERT INTO tc_category1
    SELECT *
    FROM tc_category
    GROUP BY category_id, application_id
    HAVING count(*) > 1
    
  3. Then execute this query

    INSERT INTO tc_category1
    SELECT *
    FROM tc_category
    GROUP BY category_id, application_id
    HAVING count(*) = 1
    
查看更多
浅入江南
7楼-- · 2018-12-31 00:12

I would prefer CTE for deleting duplicate rows from sql server table

strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/

by keeping original

WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)

DELETE FROM CTE WHERE RN<>1

without keeping original

WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
查看更多
登录 后发表回答