MySQL remove duplicates from big database quick

2019-01-03 21:17发布

I've got big (>Mil rows) MySQL database messed up by duplicates. I think it could be from 1/4 to 1/2 of the whole db filled with them. I need to get rid of them quick (i mean query execution time). Here's how it looks:
id (index) | text1 | text2 | text3
text1 & text2 combination should be unique, if there are any duplicates, only one combination with text3 NOT NULL should remain. Example:

1 | abc | def | NULL  
2 | abc | def | ghi  
3 | abc | def | jkl  
4 | aaa | bbb | NULL  
5 | aaa | bbb | NULL  

...becomes:

1 | abc | def | ghi   #(doesn't realy matter id:2 or id:3 survives)   
2 | aaa | bbb | NULL  #(if there's no NOT NULL text3, NULL will do)

New ids cold be anything, they do not depend on old table ids.
I've tried things like:

CREATE TABLE tmp SELECT text1, text2, text3
FROM my_tbl;
GROUP BY text1, text2;
DROP TABLE my_tbl;
ALTER TABLE tmp RENAME TO my_tbl;

Or SELECT DISTINCT and other variations.
While they work on small databases, query execution time on mine is just huge (never got to the end, actually; > 20 min)

Is there any faster way to do that? Please help me solve this problem.

9条回答
乱世女痞
2楼-- · 2019-01-03 21:44

For large tables with few duplicates, you may want to avoid copying the whole table to another place. One way is to create a temporary table holding the rows you want to keep (for each key with duplicates), and then delete duplicates from the original table.

An example is given here.

查看更多
三岁会撩人
3楼-- · 2019-01-03 21:45

you can remove all the duplicate entries by using this simple query. that will select all the duplicate records and remove them.

 DELETE i1 
FROM TABLE i1
LEFT JOIN TABLE i2
  ON i1.id = i2.id
 AND i1.colo = i2.customer_invoice_id
 AND i1.id < i2.id
WHERE i2.customer_invoice_id IS NOT NULL
查看更多
女痞
4楼-- · 2019-01-03 21:47

Found this simple 1-line code to do exactly what I needed:

ALTER IGNORE TABLE dupTest ADD UNIQUE INDEX(a,b);

Taken from: http://mediakey.dk/~cc/mysql-remove-duplicate-entries/

查看更多
爱情/是我丢掉的垃圾
5楼-- · 2019-01-03 21:48

remove duplicates without removing foreign keys

create table tmp like mytable;
ALTER TABLE tmp ADD UNIQUE INDEX(text1, text2, text3, text4, text5, text6);
insert IGNORE into tmp select * from mytable;
delete from mytable where id not in ( select id from tmp);
查看更多
可以哭但决不认输i
6楼-- · 2019-01-03 21:53

If you can create a new table, do so with a unique key on the text1 + text2 fields. Then insert into the table ignoring errors (using the INSERT IGNORE syntax):

select * from my_tbl order by text3 desc
  • I think the order by text3 desc will put the NULLs last, but double check that.

Indexes on all those columns could help a lot, but creating them now could be pretty slow.

查看更多
Viruses.
7楼-- · 2019-01-03 21:53

I know this is an Old thread but I have a somewhat messy method that is much faster and customizable, in terms of speed I'd say 10sec instead of 100sec (10:1).

My method does required all that messy stuff you were trying to avoid:

  • Group by (and Having)
  • group concat with ORDER BY
  • 2 temporary tables
  • using files on disk!
  • somehow (php?) deleting the file after

But when you are talking about MILLIONS (or in my case Tens of Millions) it's worth it.

anyway its not much because comment are in portuguese but here is my sample:

EDIT: if I get comments I'll explain further how it works :)

START TRANSACTION;

DROP temporary table if exists to_delete;

CREATE temporary table to_delete as (
    SELECT
        -- escolhe todos os IDs duplicados menos os que ficam na BD
        -- A ordem de escolha dos IDs é dada por "ORDER BY campo_ordenacao DESC" em que o primeiro é o que fica
        right(
            group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ','),
            length(group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ',')) 
                - locate(",",group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ','))
        ) as ids,

        count(*) as c

    -- Tabela a eliminar duplicados
    FROM teste_dup

    -- campos a usar para identificar  duplicados
    group by test_campo1, test_campo2, teste_campoN
    having count(*) > 1 -- é duplicado
);

-- aumenta o limite desta variável de sistema para o máx 
SET SESSION group_concat_max_len=4294967295;

-- envia os ids todos a eliminar para um ficheiro
select group_concat(ids SEPARATOR ',') from to_delete INTO OUTFILE 'sql.dat';

DROP temporary table if exists del3;
create temporary table del3 as (select CAST(1 as signed) as ix LIMIT 0);

-- insere os ids a eliminar numa tabela temporaria a partir do ficheiro
load data infile 'sql.dat' INTO TABLE del3
LINES TERMINATED BY ',';

alter table del3 add index(ix);

-- elimina os ids seleccionados
DELETE teste_dup -- tabela 
from teste_dup -- tabela

join del3 on id=ix;

COMMIT;
查看更多
登录 后发表回答