I've seen a couple of solutions for this, but I'm wondering what the best and most efficient way is to de-dupe a table. You can use code (SQL, etc.) to illustrate your point, but I'm just looking for basic algorithms. I assumed there would already be a question about this on SO, but I wasn't able to find one, so if it already exists just give me a heads up.
(Just to clarify - I'm referring to getting rid of duplicates in a table that has an incremental automatic PK and has some rows that are duplicates in everything but the PK field.)
SELECT DISTINCT <insert all columns but the PK here> FROM foo
. Create a temp table using that query (the syntax varies by RDBMS but there's typically aSELECT … INTO
orCREATE TABLE AS
pattern available), then blow away the old table and pump the data from the temp table back into it.You could generate a hash for each row (excluding the PK), store it in a new column (or if you can't add new columns, can you move the table to a temp staging area?), and then look for all other rows with the same hash. Of course, you would have to be able to ensure that your hash function doesn't produce the same code for different rows.
If two rows are duplicate, does it matter which you get rid of? Is it possible that other data are dependent on both of the duplicates? If so, you will have to go through a few steps:
dupeA
to eliminatedupeA
dupeB
dupeA
.This could be easy or complicated, depending on your existing data model.
This whole scenario sounds like a maintenance and redesign project. If so, best of luck!!