I am performing data clean up and one of my tasks is to delete similar duplicate contacts.
EXAMPLE:
BILL CROSBIE, BILL CROSBY, BILL CROSSBY; or KRISTEN HARRIS, KRISTIN HARIS.
So, there is no exact rule, but by manually scanning this, I can tell that they are very similar and must be duplicates.
Can anyone, provide an example of how I can do this using SSIS.
I understand that I can use the fuzzy lookup, but it requires a reference table or a reference data that is correct and would then compare to the table that needs data cleanup. However, is there a possibility that I can use the script component tool in SSIS to use an alogirthm that gets the characters with most matches. What would that C# code look like?
I am new to using SSIS and don't have much experience. Or is there some sort of script I can create in MSSQL that can do this?
I would use the SSIS Fuzzy Lookup component. I would use your Contacts table as the reference input, and store the new index (effectively creating an output table). I would configure the component's Advanced page to allow multiple matches and reduce the Similarity threshold.
After executing I would query the new index table, examining the similarity and confidence scores. Scores above a certain threshold (depends on your data) would indicate a duplicate.