Remove duplicate rows from table with join

2019-02-18 10:40发布

问题:

I have two table to contain state (state_table) and city (city_table) of countries

The city table is having state_id to relate it with state_table

Both the tables are already having data in it.

Now the problem

City table contains multiple entries of a city within one state. And another cities may or may not have the same city name as well

e.g.: cityone will have 5 occurrence in the city table with stateone and 2 occurrence with statetwo

So how will I write a query to keep one city for each state and delete the rest?

Schema follows

CREATE TABLE IF NOT EXISTS `city_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `state_id` int(11) NOT NULL,
  `city` varchar(25) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;


CREATE TABLE IF NOT EXISTS `state_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `state` varchar(15) NOT NULL,
  `country_id` smallint(5) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

This is the sample data

id   state_id   city
1   1   city_one
2   1   city_two
3   1   city_one
4   1   city_two
5   2   city_one
6   3   city_three
7   3   city_one
8   3   city_three
9   4   city_four
10  4   city_five

Original table has 152,451 rows

回答1:

If you want to remove duplicate city with same state_id (duplicate records), you can do that by grouping them by city and state_id and using MIN or MAX function:

Before delete query your table was looking like

| ID | STATE_ID |       CITY |
------------------------------
|  1 |        1 |   city_one |
|  2 |        1 |   city_two |
|  3 |        1 |   city_one |
|  4 |        1 |   city_two |
|  5 |        2 |   city_one |
|  6 |        3 | city_three |
|  7 |        3 |   city_one |
|  8 |        3 | city_three |
|  9 |        4 |  city_four |
| 10 |        4 |  city_five |

You can use the following query to remove duplicate records:

DELETE city_table 
  FROM city_table
  LEFT JOIN 
  (SELECT MIN(id) AS IDs FROM city_table
   GROUP BY city,state_id
  )A
  ON city_table.ID = A.IDs
  WHERE A.ids IS NULL;

After applying the above query your table will look like:

| ID | STATE_ID |       CITY |
------------------------------
|  1 |        1 |   city_one |
|  2 |        1 |   city_two |
|  5 |        2 |   city_one |
|  6 |        3 | city_three |
|  7 |        3 |   city_one |
|  9 |        4 |  city_four |
| 10 |        4 |  city_five |

See this SQLFiddle

For more see DELETE Syntax of MySQL.



回答2:

DELETE FROM city_table 
WHERE id NOT IN 
  (SELECT MIN(id) 
   FROM city_table 
   GROUP BY state_id, city)

If you'll find this query too slow, you can create temporary table, and store output of subquery in it, then truncate original table and refill it's contents. It is a bit dirty solution, as you would have to set auto_increment column values.