I've got big (>Mil rows) MySQL database messed up by duplicates. I think it could be from 1/4 to 1/2 of the whole db filled with them.
I need to get rid of them quick (i mean query execution time).
Here's how it looks:
id (index) | text1 | text2 | text3
text1 & text2 combination should be unique,
if there are any duplicates, only one combination with text3 NOT NULL should remain. Example:
1 | abc | def | NULL
2 | abc | def | ghi
3 | abc | def | jkl
4 | aaa | bbb | NULL
5 | aaa | bbb | NULL
...becomes:
1 | abc | def | ghi #(doesn't realy matter id:2 or id:3 survives)
2 | aaa | bbb | NULL #(if there's no NOT NULL text3, NULL will do)
New ids cold be anything, they do not depend on old table ids.
I've tried things like:
CREATE TABLE tmp SELECT text1, text2, text3
FROM my_tbl;
GROUP BY text1, text2;
DROP TABLE my_tbl;
ALTER TABLE tmp RENAME TO my_tbl;
Or SELECT DISTINCT and other variations.
While they work on small databases, query execution time on mine is just huge (never got to the end, actually; > 20 min)
Is there any faster way to do that? Please help me solve this problem.
For large tables with few duplicates, you may want to avoid copying the whole table to another place. One way is to create a temporary table holding the rows you want to keep (for each key with duplicates), and then delete duplicates from the original table.
An example is given here.
you can remove all the duplicate entries by using this simple query. that will select all the duplicate records and remove them.
Found this simple 1-line code to do exactly what I needed:
Taken from: http://mediakey.dk/~cc/mysql-remove-duplicate-entries/
remove duplicates without removing foreign keys
If you can create a new table, do so with a unique key on the text1 + text2 fields. Then insert into the table ignoring errors (using the INSERT IGNORE syntax):
Indexes on all those columns could help a lot, but creating them now could be pretty slow.
I know this is an Old thread but I have a somewhat messy method that is much faster and customizable, in terms of speed I'd say 10sec instead of 100sec (10:1).
My method does required all that messy stuff you were trying to avoid:
But when you are talking about MILLIONS (or in my case Tens of Millions) it's worth it.
anyway its not much because comment are in portuguese but here is my sample:
EDIT: if I get comments I'll explain further how it works :)