What is an example of a fast SQL to get duplicates in datasets with hundreds of thousands of records. I typically use something like:
SELECT afield1, afield2 FROM afile a
WHERE 1 < (SELECT count(afield1) FROM afile b WHERE a.afield1 = b.afield1);
But this is quite slow.
You could try:
By the way, if anyone wants to remove the duplicates, I have used this:
This is the more direct way:
A similar question was asked last week. There are some good answers there.
SQL to find duplicate entries (within a group)
In that question, the OP was interested in all the columns (fields) in the table (file), but rows belonged in the same group if they had the same key value (afield1).
There are three kinds of answers:
subqueries in the where clause, like some of the other answers in here.
an inner join between the table and the groups viewed as a table (my answer)
and analytic queries (something that's new to me).
This should be reasonably fast (even faster if the dupeFields are indexed).
I guess the only downside to this query is that because you're not doing a
COUNT(*)
you can't check for the number of times it is duplicated, only that it appears more than once.