What's the best way to dedupe a table?

2019-01-21 19:33发布

I've seen a couple of solutions for this, but I'm wondering what the best and most efficient way is to de-dupe a table. You can use code (SQL, etc.) to illustrate your point, but I'm just looking for basic algorithms. I assumed there would already be a question about this on SO, but I wasn't able to find one, so if it already exists just give me a heads up.

(Just to clarify - I'm referring to getting rid of duplicates in a table that has an incremental automatic PK and has some rows that are duplicates in everything but the PK field.)

14条回答
相关推荐>>
2楼-- · 2019-01-21 20:10

Using analytic function row_number:

WITH CTE (col1, col2, dupcnt)
AS
(
SELECT col1, col2,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1) AS dupcnt
FROM Youtable
)
DELETE
FROM CTE
WHERE dupcnt > 1
GO                                                                 
查看更多
唯我独甜
3楼-- · 2019-01-21 20:10

These methods will work, but without an explicit id as a PK then determining which rows to delete could be a problem. The bounce out into a temp table delete from original and re-insert without the dupes seems to be the simplest.

查看更多
走好不送
4楼-- · 2019-01-21 20:12

Deduping is rarely simple. That's because the records to be dedupped often have slightly different values is some of the fields. Therefore choose which record to keep can be problematic. Further, dups are often people records and it is hard to identify if the two John Smith's are two people or one person who is duplicated. So spend a lot (50% or more of the whole project) of your time defining what constitutes a dup and how to handle the differences and child records.

How do you know which is the correct value? Further dedupping requires that you handle all child records not orphaning any. What happens when you find that by changing the id on the child record you are suddenly violating one of the unique indexes or constraints - this will happen eventually and your process needs to handle it. If you have chosen foolishly to apply all your constraints only thorough the application, you may not even know the constraints are violated. When you have 10,000 records to dedup, you aren't going to go through the application to dedup one at a time. If the constraint isn't in the database, lots of luck in maintaining data integrity when you dedup.

A further complication is that dups don't always match exactly on the name or address. For instance a salesrep named Joan Martin may be a dup of a sales rep names Joan Martin-Jones especially if they have the same address and email. OR you could have John or Johnny in the name. Or the same street address except one record abbreveiated ST. and one spelled out Street. In SQL server you can use SSIS and fuzzy grouping to also identify near matches. These are often the most common dups as the fact that weren't exact matches is why they got put in as dups in the first place.

For some types of dedupping, you may need a user interface, so that the person doing the dedupping can choose which of two values to use for a particular field. This is especially true if the person who is being dedupped is in two or more roles. It could be that the data for a particular role is usually better than the data for another role. Or it could be that only the users will know for sure which is the correct value or they may need to contact people to find out if they are genuinely dups or simply two people with the same name.

查看更多
手持菜刀,她持情操
5楼-- · 2019-01-21 20:12

Here's the method I use if you can get your dupe criteria into a group by statement and your table has an id identity column for uniqueness:

delete t
from tablename t
inner join  
(
    select date_time, min(id) as min_id
    from tablename
    group by date_time
    having count(*) > 1
) t2 on t.date_time = t2.date_time
where t.id > t2.min_id

In this example the date_time is the grouping criteria, if you have more than one column make sure to join on all of them.

查看更多
女痞
6楼-- · 2019-01-21 20:14

Adding the actual code here for future reference

So, there are 3 steps, and therefore 3 SQL statements:

Step 1: Move the non duplicates (unique tuples) into a temporary table

CREATE TABLE new_table as
SELECT * FROM old_table WHERE 1 GROUP BY [column to remove duplicates by];

Step 2: delete the old table (or rename it) We no longer need the table with all the duplicate entries, so drop it!

DROP TABLE old_table;

Step 3: rename the new_table to the name of the old_table

RENAME TABLE new_table TO old_table;

And of course, don't forget to fix your buggy code to stop inserting duplicates!

查看更多
ら.Afraid
7楼-- · 2019-01-21 20:15

I think this should require nothing more then just grouping by all columns except the id and choosing one row from every group - for simplicity just the first row, but this does not actually matter besides you have additional constraints on the id.

Or the other way around to get rid of the rows ... just delete all rows accept a single one from all groups.

查看更多
登录 后发表回答