Finding similar contact names within table

2019-09-07 17:21发布

I am performing data clean up and one of my tasks is to delete similar duplicate contacts.

EXAMPLE:

BILL CROSBIE, BILL CROSBY, BILL CROSSBY; or KRISTEN HARRIS, KRISTIN HARIS.

So, there is no exact rule, but by manually scanning this, I can tell that they are very similar and must be duplicates.

Can anyone, provide an example of how I can do this using SSIS.

I understand that I can use the fuzzy lookup, but it requires a reference table or a reference data that is correct and would then compare to the table that needs data cleanup. However, is there a possibility that I can use the script component tool in SSIS to use an alogirthm that gets the characters with most matches. What would that C# code look like?

I am new to using SSIS and don't have much experience. Or is there some sort of script I can create in MSSQL that can do this?

标签： sql sql-server ssis

1条回答

在下西门庆

2楼-- · 2019-09-07 18:07

I would use the SSIS Fuzzy Lookup component. I would use your Contacts table as the reference input, and store the new index (effectively creating an output table). I would configure the component's Advanced page to allow multiple matches and reduce the Similarity threshold.

After executing I would query the new index table, examining the similarity and confidence scores. Scores above a certain threshold (depends on your data) would indicate a duplicate.

0人赞添加讨论(0) 举报

Finding similar contact names within table

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间