Pandas fuzzy detect duplicates

2020-06-23 08:14发布

站内文章 / Python

104 0

够拽才男人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

How can use fuzzy matching in pandas to detect duplicate rows (efficiently)

How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?

回答1:

Not pandas specific, but within the python ecosystem the dedupe python library would seem to do what you want. In particular, it allows you to compare each column of a row separately and then combine the information into a single probability score of a match.