My original question with all the relevant context can be found here:
Adding a multi-column primary key to a table with 40 million records
I have a table with 40 million rows and no primary key. Before I add the primary key, I would like to check if the table has any duplicate entries. When I say duplicate entries, I don't just mean duplicate on particular columns. I mean duplicates on entire rows.
I was told in my last question that I can do an EXISTS
query to determine duplicates. How would I do that?
I am running PostgreSQL 8.1.22. (Got this info by running select version()
).
shouldn't something like that do the job?
not sure if its the most efficient way, but count>1 means you have two identical rows.
To find whether any full duplicate exists (identical on all columns), this is probably the fastest way:
NATURAL JOIN
is a very convenient shorthand for the case because (quoting the manual here):EXISTS
is probably fastest, because Postgres stops searching as soon as the first duplicate is found. Since you most probably don't have an index covering the whole row and your table is huge, this will save you a lot of time.Be aware that
NULL
is never considered identical to anotherNULL
. If you haveNULL
values and consider them identical, you'd have to do more.ctid
is a system column that can be (ab-)used as ad-hoc primary key, but cannot replace an actual user-defined primary key in the long run.The outdated version 8.1 seems to have no
<>
operator defined for actid
. Try casting totext
: