Select multiple field duplicates from MySQL Databa

2019-03-01 09:52发布

I've got an old forum which contains threads with duplicate first posts (perhaps differing replies). I want to delete all but one of these threads (leaving the thread with the highest view count).

I have the following SQL query to help identify duplicate threads, but I can't find a way for it to list only duplicates with the lowest value for the xf_thread.view_count column:

SELECT
    t.thread_id, MIN(t.view_count)
FROM
    xf_thread t
INNER JOIN
    xf_post p ON p.thread_id = t.thread_id
WHERE
    t.first_post_id = p.post_id
GROUP BY
    t.title,
    t.username,
    p.message
HAVING
    COUNT(t.title) > 1
    AND COUNT(t.username) > 1
    AND COUNT(p.message) > 1;

At the moment, this query correctly groups threads, but it only shows a random thread_id - rather than the thread_id corresponding to min(view_count).

I've read up on how to work around this, but I can't figure out how to achieve this - as it doesn't look like it's possible to order the row in group by queries.

edit

Thanks to the help from Madhur, the query now returns all the of the thread IDs to be deleted. However, I can figure out how to delete the rows with matching thread_id's. Here's the query I tried to use (it just keeps running, whereas the select query (https://stackoverflow.com/a/52314208/2469308) runs in seconds:

DELETE FROM xf_thread 
WHERE  thread_id IN (SELECT Substring_index(Group_concat(DISTINCT t.thread_id 
                                            ORDER BY 
                                            t.view_count 
                                                                ASC 
                                            SEPARATOR ','), 
                                   ',', 1) AS 
                            thread_id_with_minimum_views 
                     FROM   (SELECT * 
                             FROM   xf_thread) t 
                            INNER JOIN xf_post p 
                                    ON p.thread_id = t.thread_id 
                     WHERE  t.first_post_id = p.post_id 
                            AND t.user_id = 0 
                            AND t.reply_count < 2 
                     GROUP  BY t.title, 
                               t.username, 
                               p.message 
                     HAVING Count(t.title) > 1 
                            AND Count(t.username) > 1 
                            AND Count(p.message) > 1 
                     ORDER  BY t.thread_id); 

1条回答
看我几分像从前
2楼-- · 2019-03-01 10:20

A very hacky solution is to get the thread_id sorted by view_count in a GROUP_CONCAT. Then, we can utilize string operations to get the thread_id with minimum view_count.

In your SELECT clause, instead of t.thread_id, you can try the following:

SUBSTRING_INDEX(GROUP_CONCAT(DISTINCT t.thread_id 
                             ORDER BY t.view_count ASC 
                             SEPARATOR ','), 
                ',', 
                1) AS thread_id_with_minimum_views

Now, based on the SELECT query to identify Duplicate records with Minimum view, DELETE query to delete such records from the xf_thread table will be as follows:

DELETE t_delete FROM xf_thread AS t_delete 
INNER JOIN (SELECT CAST(SUBSTRING_INDEX(GROUP_CONCAT(DISTINCT t.thread_id ORDER BY t.view_count ASC SEPARATOR ','), ',', 1) AS UNSIGNED) AS tid_min_view 
            FROM (SELECT * FROM xf_thread) t 
            INNER JOIN xf_post p ON p.thread_id = t.thread_id 
            WHERE t.first_post_id = p.post_id 
              AND t.user_id = 0 
              AND t.reply_count < 2 
            GROUP BY t.title, t.username, p.message 
            HAVING Count(t.title) > 1 
               AND Count(t.username) > 1 
               AND Count(p.message) > 1 
            ORDER BY t.thread_id) AS t_dup 
  ON t_delete.thread_id = t_dup.tid_min_view 
查看更多
登录 后发表回答