Check for complete duplicate rows in a large table

My original question with all the relevant context can be found here:
Adding a multi-column primary key to a table with 40 million records

I have a table with 40 million rows and no primary key. Before I add the primary key, I would like to check if the table has any duplicate entries. When I say duplicate entries, I don't just mean duplicate on particular columns. I mean duplicates on entire rows.

I was told in my last question that I can do an EXISTS query to determine duplicates. How would I do that?

I am running PostgreSQL 8.1.22. (Got this info by running select version()).

标签： sql postgresql exists duplicates postgresql-8.1

2条回答

兄弟一词,经得起流年.

2楼-- · 2019-05-19 01:00

shouldn't something like that do the job?

SELECT ALL_COLUMNS[expect unique ID],
       count(0) as Dupl 
FROM   table 
WHERE  Dupl>1
GROUP BY ALL_COLUMNS[expect unique ID];

not sure if its the most efficient way, but count>1 means you have two identical rows.

0人赞添加讨论(0) 举报

倾城　Initia

3楼-- · 2019-05-19 01:03

To find whether any full duplicate exists (identical on all columns), this is probably the fastest way:

SELECT EXISTS (
    SELECT 1
    FROM   tbl t
    NATURAL JOIN tbl t1 
    WHERE  t.ctid <> t1.ctid
    )

NATURAL JOIN is a very convenient shorthand for the case because (quoting the manual here):

NATURAL is shorthand for a USING list that mentions all columns in the two tables that have the same names.

EXISTS is probably fastest, because Postgres stops searching as soon as the first duplicate is found. Since you most probably don't have an index covering the whole row and your table is huge, this will save you a lot of time.

Be aware that NULL is never considered identical to another NULL. If you have NULL values and consider them identical, you'd have to do more.

ctid is a system column that can be (ab-)used as ad-hoc primary key, but cannot replace an actual user-defined primary key in the long run.

The outdated version 8.1 seems to have no <> operator defined for a ctid. Try casting to text:

SELECT EXISTS (
    SELECT 1
    FROM   tbl t
    NATURAL JOIN tbl t1 
    WHERE  t.ctid::text <> t1.ctid::text
    )

0人赞添加讨论(0) 举报

Check for complete duplicate rows in a large table

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间