How do I delete duplicate records, or merge them w

2019-08-11 14:35发布

问题:

I have a database generated from an XML document with duplicate records. I know how to delete one record from the main table, but not those with foreign-key restraints.

I have a large amount of XML documents, and they are inserted without caring about duplicates or not. One solution to removing duplicates is to just delete the lowest Primary_Key values (and all related foreign key records) and keep the highest. I don't know how to do that, though.

The database looks like this:

Table 1: [type]

+-------------+---------+-----------+
| Primary_Key | Food_ID | Food_Type |
+-------------+---------+-----------+
|   70001     |  12345  |  fruit    |
|   70002     |  12345  |  fruit    |
|   70003     |  12345  |  meat     |
+----^--------+---------+-----------+
     |
     |-----------------|
                       |    
                       | Linked to primary key in the first table
+-------------+--------v--------+-------------+-------------+------------+
| Primary_Key |  Information_ID |   Food_Name | Information |  Comments  | 
+-------------+-----------------+-------------+-------------+------------+
|   0001      |     70001       |   banana    |  buy @ toms | delicious! |
|   0002      |     70002       |   banana    |  buy @ mats | so-so      |
|   0003      |     70003       | decade meat |  buy @ sals | disgusting |
+-------------+-----------------+-------------+-------------+------------+

^ Table 2: [food_information]

There are several other linked tables as well, which all have a foreign key value of the matched primary key value in the main table ([type]).

My question based on which solution might be best:

  1. How do I delete all of those records, except 70003 (the highest one)? We can't know if it's duplicate record unless [Food_ID] shows up more than once. If it shows up more than once, we need to delete records from ALL tables (there are 10) based on the Primary_Key and Foreign_Key relationship.
  2. How do I update/merge these SQL records on insertion to avoid having to delete multiples again?

I'd prefer #1, as it prevents me from having to rebuild the database, and it makes inserting much easier.

Thanks!

回答1:

Even if a [foodID] is not duplicated you will get a max(Primary_Key)
And it will not be deleted
The where condition is NOT in

delete tableX 
 where tableX.informationID not in ( select max(Primary_Key) 
                                       from [type] 
                                   group by [foodID] )


then just do [type] last


delete [type] 
 where [type].[Primary_Key] not in ( select max(Primary_Key) 
                                       from [type] 
                                      group by [foodID] )

then just create as unique constraint on [foodID]



回答2:

something like...

assumed:

create table food (
  primary_key int,
  food_id int,
  food_type varchar(20)
  );

insert into food values (70001,12345,'fruit'); 
insert into food values (70002,12345,'fruit'); 
insert into food values (70003,12345,'meat'); 
insert into food values (70004,11111,'taco'); 

create table info (
  primary_key int,
  info_id int,
  food_name varchar(20)
  );

insert into info values (1,70001,'banana'); 
insert into info values (2,70002,'banana'); 
insert into info values (3,70003,'decade meat'); 
insert into info values (4,70004,'taco taco'); 

and then...

-- yields:   12345   70003 

select  food_id, max(info_id) as max_info_id
from    food 
        join info on food.primary_key=info.info_id
where   food_id in ( 
            select food_id
            from   food 
                   join info on food.primary_key=info.info_id
            group  by food_id
            having count(*)>1);  

then... something like... this to get the ones to delete. there might be a better way to write this... i'm thinking about it.

select  *
from    food 
        join info on food.primary_key=info.info_id 
        join ( select  food_id, max(info_id) as max_info_id
               from    food 
                       join info on food.primary_key=info.info_id
               where   food_id in ( 
                           select food_id
                           from   food 
                                  join info on food.primary_key=info.info_id
                           group  by food_id
                           having count(*)>1) 
               ) as dont_delete
            on food.food_id=dont_delete.food_id and 
               info.info_id<max_info_id  

gives you:

PRIMARY_KEY FOOD_ID FOOD_TYPE   INFO_ID FOOD_NAME   MAX_INFO_ID
70001       12345   fruit       70001   banana      70003
70002       12345   fruit       70002   banana      70003

so you could do.... just delete from food where primary_key in (select food.primary_key from that_big_query_up_there) and delete from info where info_id in (select food.primary_key from that_big_query_up_there)

for future issues, maybe consider a unique constraint on food... unique(primary_key,food_id) or something but if it's one-to-one, why don't you just store them together...?



标签: sql xml tsql