I've seen a couple of solutions for this, but I'm wondering what the best and most efficient way is to de-dupe a table. You can use code (SQL, etc.) to illustrate your point, but I'm just looking for basic algorithms. I assumed there would already be a question about this on SO, but I wasn't able to find one, so if it already exists just give me a heads up.
(Just to clarify - I'm referring to getting rid of duplicates in a table that has an incremental automatic PK and has some rows that are duplicates in everything but the PK field.)
For deduplicate / dedupe / remove duplication / remove repeated rows / 数据库 除重 / 数据库 去除 重复记录, there are multiple ways.
If duplicated rows are exact the same, use group by
create table TABLE_NAME_DEDUP
as select column1, column2, ... (all column names) from TABLE_NAME group by column1, column2, -- all column names
Then TABLE_NAME_DEDUP is the deduplicated table.
For example,
You have a rowid, the rowid has duplication but other columns are different Records partial same, this may happened in a transactional system while update a row, and the rows failed to update will have nulls. You want to remove the duplication
create table test_dedup as select column1, column2, ... (all column names) from ( select * , row_number() over (partition by rowid order by column1, column2, ... (all column names except rowid) ) as cn from test ) where cn =1
This is using the feature that when you use order by, the null value will be ordered behind the non-null value.
I am taking the one from DShook and providing a dedupe example where you would keep only the record with the highest date.
In this example say I have 3 records all with the same app_id, and I only want to keep the one with the highest date:
For those of you who prefer a quick and dirty approach, just list all the columns that together define a unique record and create a unique index with those columns, like so:
ALTER IGNORE TABLE
TABLE_NAME
ADD UNIQUE (column1
,column2
,column3
)You can drop the unique index afterwords.
Here's one I've run into, in real life.
Assume you have a table of external/3rd party logins for users, and you're going to merge two users and want to dedupe on the provider/provider key values.
This can dedupe the duplicated values in
c1
:For SQL, you may use the INSERT IGNORE INTO table SELECT xy FROM unkeyed_table;
For an algorithm, if you can assume that to-be-primary keys may be repeated, but a to-be-primary-key uniquely identifies the content of the row, than hash only the to-be-primary key and check for repetition.