Effective way to delete duplicate rows from millio

2019-09-11 17:31发布

问题:

I am looking to find an effective way to delete duplicated records from my database. First, I used a stored procedure that uses joins and such, which caused the query to execute very slow. Now, I am trying a different approach. Please consider the following queries:

/* QUERY A */

SELECT *
FROM my_table
WHERE col1 = value
  AND col2 = value
  AND col3 = value

This query just executed in 12 seconds, with a result of 182.400 records. The row count in the table is currently 420.930.407, and col1 and col3 are indexed.

The next query:

/* QUERY B */

WITH ALL_RECORDS AS
  (SELECT id
   FROM my_table
   WHERE col1 = value
     AND col2 = value
     AND col3 = value)
SELECT *
FROM ALL_RECORDS

This query took less than 2 seconds, and gives me all the id's of the 182.400 records in the table (according to the where clause).

Then, my last query, is a query that selects the lowest (first) id of all records grouped on the columns I want to group on to check for duplicates:

/* QUERY C */

SELECT MIN(id)
FROM my_table
WHERE col1 = value
  AND col2 = value
  AND col3 = value
GROUP BY col1,
         col2,
         col3,
         col4,
         col5,
         col6

Again, this query executes in less than 2 seconds. The result is 30.400, which means there are 30.400 unique records among the 182.400 records that are unique.

Now, I'd like to delete (or, first select to make sure I have my query right) all records that are not unique. So, I'd like to remove 182.400 - 30.400 = 152.000 records from my_table.

I thought I'd combine the two last queries: get all id's that belong to my dataset according to the where clause on col1, col2 and col3 (query B), and then delete/select all records from that dataset of which the id is not in the id list of the unique record id's (query C).

However, when I select all from query B where query B.id NOT IN query C, the query does not take 2, 4 or 12 (14 or 16) seconds, but seems to take forever (20.000 records shown after 1 minute, around 40.000 after 2 minutes, so I canceled the query since it'll find 152.000 records, which will take 8 minutes this way).

WITH ALL_RECORDS AS
  (SELECT id
   FROM my_table
   WHERE col1 = value
     AND col2 = value
     AND col3 = value)
SELECT id
FROM ALL_RECORDS
WHERE id NOT IN
    (SELECT MIN(id)
     FROM my_table
     WHERE col1 = value
       AND col2 = value
       AND col3 = value
     GROUP BY col1,
              col2,
              col3,
              col4,
              col5,
              col6)

I know NOT IN is slow, but I can't grasp how it's THIS slow (since both queries without the not in part execute in less than 2 seconds each).

Does anyone have some good advice for me on how to solve this puzzle?

------------------ Additional information ------------------

Previous solution was the following stored procedure. For some reason it executes perfectly on my acceptance environment, but not on my production environment. Currently, we have over 400 million records on production and a little over 2 million records on acceptance, so this might be a reason.

DELETE my_table
FROM my_table
LEFT OUTER JOIN
  (SELECT MIN(id) AS RowId,
          col1,
          col2,
          col3,
          col4,
          col5,
          col6
   FROM my_table
   WHERE col1 = value
     AND col2 = value
     AND col3 = value
   GROUP BY col1,
            col2,
            col3,
            col4,
            col5,
            col6) AS KeepRows ON my_table.id = KeepRows.RowId
WHERE KeepRows.RowId IS NULL
  AND my_table.col1 = value
  AND my_table.col2 = value
  AND my_table.col3 = value

I have based this solution on another answer on stackoverflow (can't find it at the moment), but I feel I should be able to create a query based on Query B and C that executes within a few seconds...

回答1:

with dupl as (
select row_number() over(partition by col1,col2,col3,col4,col5,col6 order by id) rn,
id,col1,col2,col3,col4,col5,col6
from myTable
)
delete dupl where rn>1


回答2:

Combining two 2-second queries together will not, generally, result in a single 4-second query, because queries, unlike their underlying tables, are rarely indexed.

Usual approach for this kind of tasks is to cache id's you want to keep in a temporary table, index it accordingly and then use it in the left join (or not in - I bet the resulting execution plans are practically the same).

You can probably get some more performance if you will play with indices on the main table. For example, I think that (col1, col2, col3) should give your code some boost (columns should not necessarily be mentioned in this order, it usually depends on their cardinalities).