If I have a table with important 2 columns,
CREATE TABLE foo (id INT, a INT, b INT, KEY a, KEY b);
How can I find all the rows that have both a
and b
being the same in both rows? For example, in this data set
id | a | b
----------
1 | 1 | 2
2 | 5 | 42
3 | 1 | 42
4 | 1 | 2
5 | 1 | 2
6 | 1 | 42
I want to get back all rows except for id=2
since it is unique in (a,b)
. Basically, I want to find all offending rows that would stop a
ALTER TABLE foo ADD UNIQUE (a, b);
Something better than an n^2 for loop would be nice since my table has 10M rows.
For bonus points : How do I removed all but one of the rows (I don't care which ones, as long as one is left)
Or am I missing something?
===
Update for clarity:
++++++++++ After 3rd clarity edit:
But I'm shot, so check it yourself.
Should come up with all the rows where more that one row has the same combination of a and b.
Just hope you have an index on columns a and b.
shouldn't this work?
=== edit ===
the how about
=== final re-edit before i give up on this question ===
Try this:
This query should show duplicate rows in the table foo.
Your stated goal is to remove all duplicate combination of
(a,b)
. For that, you can use a multi-table DELETE:Before you run it, you can check which rows will be removed with:
The WHERE clause being
t2.id > t1.id
it will remove all but the one with the highest value forid
. In your case, only the rows withid
equal to 2, 5 or 6 would remain.here's another approach
anyway, even though I find it a bit more readable, if you have such a huge table, you should check the execution plan, subqueries have a bad reputation involving performance...
you should also consider creating the index (without the unique clause, obviously) to speed up the query... for huge operations, sometimes it's better to spend the time creating the index, perform the update and then drop the index... in this case, I guess an index on (a, b) should certainly help a lot...