Deleting Duplicate Rows from MySql Table

I have a script to find duplicate rows in my MySql table, the table contains 40,000,000 rows. but it is very slow going, is there an easier way to find the duplicate records without going in and out of php?

This is the script i currently use

 $find = mysql_query("SELECT * FROM pst_nw ID < '1000'");
        while ($row = mysql_fetch_assoc($find))
        {
            $find_1 = mysql_query("SELECT * FROM pst_nw add1 = '$row[add1]' AND add2 = '$row[add2]' AND add3 = '$row[add3]' AND add4 = '$row[add4]'");
                if (mysql_num_rows($find_1) > 0) {
                                                    mysql_query("DELETE FROM pst_nw WHERE ID ='$row[ID]'}

         }

标签： php mysql

6条回答

戒情不戒烟

2楼-- · 2020-03-28 22:42

You have a number of options.

Let the DB do the work

Create a copy of your table with a unique index - and then insert the data into it from your source table:

CREATE TABLE clean LIKE pst_nw;
ALTER IGNORE TABLE clean ADD UNIQUE INDEX (add1, add2, add3, add4);
INSERT IGNORE INTO clean SELECT * FROM pst_nw;
DROP TABLE pst_nw;
RENAME TABLE clean pst_nw;

The advantage of doing things this way is you can verify that your new table is correct before dropping your source table. The disadvantage is it takes up twice as much space and is (relatively) slow to execute.

Let the DB do the work #2

You can also achieve the result you want by doing:

set session old_alter_table=1;
ALTER IGNORE TABLE pst_nw ADD UNIQUE INDEX (add1, add2, add3, add4);

The first command is required as a workaround for the ignore flag being .. ignored

The advantage here is there's no messing about with a temporary table - the disadvantage is you don't get to check that your update does exactly what you expect before you run it.

Example:

 CREATE TABLE `foo` (
  `id` int(10) NOT NULL AUTO_INCREMENT,
  `one` int(10) DEFAULT NULL,
  `two` int(10) DEFAULT NULL,
  PRIMARY KEY (`id`)
)

insert into foo values (null, 1, 1);
insert into foo values (null, 1, 1);
insert into foo values (null, 1, 1);

select * from foo;
+----+------+------+
| id | one  | two  |
+----+------+------+
|  1 |    1 |    1 |
|  2 |    1 |    1 |
|  3 |    1 |    1 |
+----+------+------+
3 row in set (0.00 sec)

set session old_alter_table=1;
ALTER IGNORE TABLE foo ADD UNIQUE INDEX (one, two);

select * from foo;
+----+------+------+
| id | one  | two  |
+----+------+------+
|  1 |    1 |    1 |
+----+------+------+
1 row in set (0.00 sec)

Don't do this kind of thing outside the DB

Especially with 40 million rows doing something like this outside the db is likely to take a huge amount of time, and may not complete at all. Any solution that stays in the db will be faster, and more robust.

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

3楼-- · 2020-03-28 22:42

Sure there is. Note however that with 40 million records You most probably will exceed max php execution time. Try following

Create table temp_pst_nw like pst_nw;
Insert into temp_pst_nw select * from pst_nw group by add1,add2,add3,add4;

Confirm that everything is ok first!!

Drop table pat_nw;
Rename table temp_pst_nw to pst_nw;

0人赞添加讨论(0) 举报

Summer. ? 凉城

4楼-- · 2020-03-28 22:43

Your code will be better if you don't use select *, only select columns (4 address) you want to compare. It should have limit clause in my sql. It can avoid state not response when you have too large nums rows like that.

0人赞添加讨论(0) 举报

霸刀☆藐视天下

5楼-- · 2020-03-28 22:45

Usually in questions like this the problem is "I have duplicate rows, want to keep only one row, any one".

But judging from the code, what you want is: "if a set of add1, add2, add3, add4 is duplicated, DELETE ALL COPIES WITH ID < 1000". In this case, copying from the table to another with INSERT IGNORE won't do what you want - might even keep rows with lower IDs and discard subsequent ones.

I believe you need to run something like this to gather all the "bad IDs" (IDs with a duplicate, the duplicate above 1000; in this code I used "AND bad.ID < good.ID", so if you have ID 777 which duplicates to ID 888, ID 777 will still get deleted. If this is not what you want, you can modify that in "AND bad.ID < 1000 AND good.ID > 1000" or something like that).

CREATE TABLE bad_ids AS
    SELECT bad.ID FROM pst_nw AS bad JOIN pst_nw AS good
    ON ( bad.ID < 1000 AND bad.ID < good.ID
       AND bad.add1 = good.add1
       AND bad.add2 = good.add2
       AND bad.add3 = good.add3
       AND bad.add4 = good.add4 );

Then once you have all bad IDs into a table,

DELETE pst_nw.* FROM pst_nw JOIN bad_ids ON (pst_nw.ID = bad_ids.ID);

Performances will greatly benefit from a (non_unique, possibly only temporary) index on add1, add2, add3, add4 and ID in this order.

0人赞添加讨论(0) 举报

Rolldiameter

6楼-- · 2020-03-28 23:00

Try creating a new table that has the same definitions. i.e. "my_table_two", then do:

SELECT DISTINCT unique_col1, col2, col3 [...] FROM my_table INTO my_table_two;

Maybe that'll sort it out.

0人赞添加讨论(0) 举报

成全新的幸福

7楼-- · 2020-03-28 23:04

Get the duplicate rows using "Group by" operator. Here is a sample that you can try :

select id
 from table
group by matching_field1,matching_field2....
having count(id) > 1

So, you are getting all the duplicate ids. Now delete them using a delete query. Instead of using "IN", use "OR" operator as "IN" is slow compared to "OR".

0人赞添加讨论(0) 举报