I am trying to find a reliable method for matching duplicate person records within the database. The data has some serious data quality issues which I am also trying to fix but until I have the go-ahead to do so I am stuck with the data I have got.
The table columns available to me are:
SURNAME VARCHAR2(43)
FORENAME VARCHAR2(38)
BIRTH_DATE DATE
ADDRESS_LINE1 VARCHAR2(60)
ADDRESS_LINE2 VARCHAR2(60)
ADDRESS_LINE3 VARCHAR2(60)
ADDRESS_LINE4 VARCHAR2(60)
ADDRESS_LINE5 VARCHAR2(60)
POSTCODE VARCHAR2(15)
The SOUNDEX
function is relatively limited for this use but the UTL_MATCH
package seems to offer a better level of matching using the Jaro Winker algorithm.
Rather than re-inventing the wheel, has anyone implemented a reliable method for matching this type of data?
Data Quality issues to contend with:
- The postcode, though mandatory, isn't always fully entered.
- The address data is relatively poor quality with addresses entered in no fixed format (i.e. some may have line1 as "Flat 1" whereas some may have line1 as "Flat1, 22 Acacia Ave").
- The forename column can contain an initial, a full forename or sometimes more than one forename.
For example I was considering:
Concatenating all address fields and applying the Jaro Winkler algorithm to the full address combined with a similar test of the full name concatenated together.
The birth date can be compared directly for a match but due to the large volume of data just matching upon this isn't enough.
Oracle 10g R2 Enterprise Edition.
Any helpful suggestions welcome.
Alas there is no such thing. The most you can hope for is a system with a reasonable element of doubt.
The big advantage of SOUNDEX is that it tokenizes the string. This means it gives you something which can be indexed: this is incredibly valuable when it comes to large amounts of data. On the other hand it is old and crude. There are newer algorithms around, such as Metaphone and Double Metaphone. You should be able to find PL/SQL implemenations of them via Google.
The advantage of scoring is that they allow for a degree of fuzziness; so you can find all rows
where name_score >= 90%
. The crushing disadvantage is that the scores are relative and so you cannot index them. This sort of comparison kills you with large volumes.What this means is:
In my experience concatenating the tokens (first name, last name) is a mixed blessing. It solves certain problems (such as whether the road name appears in address line 1 or address line 2) but causes other problems: consider scoring GRAHAM OLIVER vs OLIVER GRAHAM against scoring OLIVER vs OLIVER, GRAHAM vs GRAHAM, OLIVER vs GRAHAM and GRAHAM vs OLIVER.
Whatever you do you will still end up with false positives and missed hits. No algorithm is proof against typos (although Jaro Winkler did pretty good with MARX vs AMRX).