I have two sets of data. Existing customers and potential customers.
My main objective is to figure out if any of the potential customers are already existing customers. However, the naming conventions of customers across data sets are inconsistent.
EXISTING CUSTOMERS
Customer / ID
Ed's Barbershop / 1002
GroceryTown / 1003
Candy Place / 1004
Handy Man / 1005
POTENTIAL CUSTOMERS
Customer
Eds Barbershop
Grocery Town
Candy Place
Handee Man
Beauty Salon
The Apple Farm
Igloo Ice Cream
Ride-a-Long Bikes
I would like to write some type of select statement like below to reach my objective:
SELECT a.Customer, b.ID
FROM PotentialCustomers a LEFT JOIN
ExistingCustomers B
ON a.Customer = b.Customer
The results would look something like:
Customer / ID
Eds Barbershop / 1002
Grocery Town / 1003
Candy Place / 1004
Handee Man / 1005
Beauty Salon / NULL
The Apple Farm / NULL
Igloo Ice Cream / NULL
Ride-a-Long Bikes / NULL
I am vaguely familiar with the concepts of Levenshtein Distance and Double Metaphone but I am not sure how to apply it here.
Ideally I would want the JOIN portion of the SELECT statement to read something like: LEFT JOIN ExistingCustomers as B WHERE a.Customer LIKE b.Customer
but I know that syntax is incorrect.
Any suggestions are welcomed. Thank you!
Trying to do this within SQL is going to be a continual challenge and one that you are not likely to win. You can go quite far by stripping out non a-z or 0-9 characters or trying something like Soundex or Metaphone matching or Levenshtein Distance but there will always be another edge case that you didn't pick up in all your replacing, wild carding, phoneticising or plain fudging.
If you do manage to find something that works with enough accuracy for you, you will then hit performance problems.
In short, your best hope is going way down the SQLCLR route and learning a lot of C# on the way or not really bothering at all and simply cleaning your data at source or creating a lookup table of 'clean' names that will require constant maintenance as new variants come in.
You need more than 1 field to accomplish this with any effectiveness. Do you have things like city, state, zip, address, etc? You can then create a multipart key with those fields concatenated. You may want to truncate some to the first 5 characters or something but the more you vary the more false positives you get.
I’ve done this and created a couple keys being less restrictive with each key. Then match trying each key and assigning a match grade when you find matches.
One way is to use the help of REPLACE function in both side of the comparing columns.
Here is how this could be done using Levenshtein Distance:
Create this function:(Execute this first)
(Function developped by Joseph Gama)
And then simply use this query to get matches
Complete Script after you create that function:
Here you can find a T-SQL example at http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx