What's the best way to dedupe a table?

2019-01-21 19:33发布

I've seen a couple of solutions for this, but I'm wondering what the best and most efficient way is to de-dupe a table. You can use code (SQL, etc.) to illustrate your point, but I'm just looking for basic algorithms. I assumed there would already be a question about this on SO, but I wasn't able to find one, so if it already exists just give me a heads up.

(Just to clarify - I'm referring to getting rid of duplicates in a table that has an incremental automatic PK and has some rows that are duplicates in everything but the PK field.)

14条回答
神经病院院长
2楼-- · 2019-01-21 20:17

SELECT DISTINCT <insert all columns but the PK here> FROM foo. Create a temp table using that query (the syntax varies by RDBMS but there's typically a SELECT … INTO or CREATE TABLE AS pattern available), then blow away the old table and pump the data from the temp table back into it.

查看更多
倾城 Initia
3楼-- · 2019-01-21 20:17

You could generate a hash for each row (excluding the PK), store it in a new column (or if you can't add new columns, can you move the table to a temp staging area?), and then look for all other rows with the same hash. Of course, you would have to be able to ensure that your hash function doesn't produce the same code for different rows.

If two rows are duplicate, does it matter which you get rid of? Is it possible that other data are dependent on both of the duplicates? If so, you will have to go through a few steps:

  • Find the dupes
  • Choose one of them as dupeA to eliminate
  • Find all data dependent on dupeA
  • Alter that data to refer to dupeB
  • delete dupeA.

This could be easy or complicated, depending on your existing data model.

This whole scenario sounds like a maintenance and redesign project. If so, best of luck!!

查看更多
登录 后发表回答