MySQL remove duplicates from big database quick

2019-01-03 21:17发布

I've got big (>Mil rows) MySQL database messed up by duplicates. I think it could be from 1/4 to 1/2 of the whole db filled with them. I need to get rid of them quick (i mean query execution time). Here's how it looks:
id (index) | text1 | text2 | text3
text1 & text2 combination should be unique, if there are any duplicates, only one combination with text3 NOT NULL should remain. Example:

1 | abc | def | NULL  
2 | abc | def | ghi  
3 | abc | def | jkl  
4 | aaa | bbb | NULL  
5 | aaa | bbb | NULL  

...becomes:

1 | abc | def | ghi   #(doesn't realy matter id:2 or id:3 survives)   
2 | aaa | bbb | NULL  #(if there's no NOT NULL text3, NULL will do)

New ids cold be anything, they do not depend on old table ids.
I've tried things like:

CREATE TABLE tmp SELECT text1, text2, text3
FROM my_tbl;
GROUP BY text1, text2;
DROP TABLE my_tbl;
ALTER TABLE tmp RENAME TO my_tbl;

Or SELECT DISTINCT and other variations.
While they work on small databases, query execution time on mine is just huge (never got to the end, actually; > 20 min)

Is there any faster way to do that? Please help me solve this problem.

9条回答
淡お忘
2楼-- · 2019-01-03 22:04
DELETE FROM dups
WHERE id NOT IN(
    SELECT id FROM (
        SELECT DISTINCT id, text1, text2
            FROM dups
        GROUP BY text1, text2
        ORDER BY text3 DESC
    ) as tmp
)

This queries all records, groups by the distinction fields and orders by ID (means we pick the first not null text3 record). Then we select the id's from that result (these are good ids...they wont be deleted) and delete all IDs that AREN'T those.

Any query like this affecting the entire table will be slow. You just need to run it and let it roll out so you can prevent it in the future.

After you have done this "fix" I would apply UNIQUE INDEX (text1, text2) to that table. To prevent the posibility of duplicates in the future.

If you want to go the "create a new table and replace the old one" route. You could use the very inner select statement to create your insert statement.

MySQL specific (assumes new table is named my_tbl2 and has exactly the same structure):

INSERT INTO my_tbl2
    SELECT DISTINCT id, text1, text2, text3
            FROM dups
        GROUP BY text1, text2
        ORDER BY text3 DESC

See MySQL INSERT ... SELECT for more information.

查看更多
祖国的老花朵
3楼-- · 2019-01-03 22:04

I don't have much experience with MySQL. If it has analytic functions try:

delete from my_tbl
 where id in (
     select id 
       from (select id, row_number()
                            over (partition by text1, text2 order by text3 desc) as rn
               from my_tbl
               /* optional: where text1 like 'a%'  */
             ) as t2
       where rn > 1
     )

the optional where clause makes the means you'll have to run it multiple times, one for each letter, etc. Create an index on text1?

Before running this, confirm that "text desc" will sort nulls last in MySQL.

查看更多
萌系小妹纸
4楼-- · 2019-01-03 22:06

I believe this will do it, using on duplicate key + ifnull():

create table tmp like yourtable;

alter table tmp add unique (text1, text2);

insert into tmp select * from yourtable 
    on duplicate key update text3=ifnull(text3, values(text3));

rename table yourtable to deleteme, tmp to yourtable;

drop table deleteme;

Should be much faster than anything that requires group by or distinct or a subquery, or even order by. This doesn't even require a filesort, which is going to kill performance on a large temporary table. Will still require a full scan over the original table, but there's no avoiding that.

查看更多
登录 后发表回答