Finding similar contact names within table

2019-09-07 17:29发布

问题:

I am performing data clean up and one of my tasks is to delete similar duplicate contacts.

EXAMPLE:

BILL CROSBIE, BILL CROSBY, BILL CROSSBY; or KRISTEN HARRIS, KRISTIN HARIS. 

So, there is no exact rule, but by manually scanning this, I can tell that they are very similar and must be duplicates.

Can anyone, provide an example of how I can do this using SSIS.

I understand that I can use the fuzzy lookup, but it requires a reference table or a reference data that is correct and would then compare to the table that needs data cleanup. However, is there a possibility that I can use the script component tool in SSIS to use an alogirthm that gets the characters with most matches. What would that C# code look like?

I am new to using SSIS and don't have much experience. Or is there some sort of script I can create in MSSQL that can do this?

回答1:

I would use the SSIS Fuzzy Lookup component. I would use your Contacts table as the reference input, and store the new index (effectively creating an output table). I would configure the component's Advanced page to allow multiple matches and reduce the Similarity threshold.

After executing I would query the new index table, examining the similarity and confidence scores. Scores above a certain threshold (depends on your data) would indicate a duplicate.