The table I am dealing with has multiple rows which have the same values for lat
and lon
. The example shows that 1
, 3
, 5
have the same location but the name
attribute differs. The hash
is built from name
, lat
and lon
and differs therefore.
BEFORE:
id | name | lat | lon | flag | hash
----+------+-----+-----+------+------
1 | aaa | 16 | 48 | 0 | 2cd <-- duplicate
2 | bbb | 10 | 22 | 0 | 3fc
3 | ccc | 16 | 48 | 0 | 8ba <-- duplicate
4 | ddd | 10 | 23 | 0 | c33
5 | eee | 16 | 48 | 0 | 751 <-- duplicate
I need to identify "duplicates" within this table and want to assign the flag 1
(primary) to one of them and the flag 2
(secondary) to the others. It is not important which "duplicate" is flagged as primary.
AFTER:
id | name | lat | lon | flag | hash
----+------+-----+-----+------+------
1 | aaa | 16 | 48 | 1 | 2cd <-- updated
2 | bbb | 10 | 22 | 0 | 3fc
3 | ccc | 16 | 48 | 2 | 8ba <-- updated
4 | ddd | 10 | 23 | 0 | c33
5 | eee | 16 | 48 | 2 | 751 <-- updated
I started experimenting with INNER JOIN
inspired by this post and this visual description. With this I am able to assign the same flag to all duplicates.
UPDATE table t1
INNER JOIN table_name t2
ON
t1.lat = t2.lat
AND t1.lon = t2.lon
AND t1.hash != t2.hash
SET
t1.flag = 2;
I also tested LEFT OUTER JOIN
with WHERE t2.id IS NULL
which could work when there are only two rows. However, I cannot think my head off how a JOIN
should work with more then two duplicates. Mark Harrison also assumes "that you're joining on columns with no duplicates" at the beginning of his post which sound as if this is not a good idea.
I am using MySQL if this is of interest.
Assuming you want all canonical records to have a
flag=1
(and not just those records that have duplicates) you can do:You could do this in two steps:
This should be used as inspiration and not as gospel (it's probably wrong) :)
Not sure this is very efficient, but it works in just one query: