Pandas fuzzy detect duplicates

2020-06-23 08:14发布

问题:

How can use fuzzy matching in pandas to detect duplicate rows (efficiently)

How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?

回答1:

Not pandas specific, but within the python ecosystem the dedupe python library would seem to do what you want. In particular, it allows you to compare each column of a row separately and then combine the information into a single probability score of a match.



回答2:

There is now a package to make it easier to use the dedupe library with pandas: pandas-dedupe

(I am a developer of the original dedupe library, but not the pandas-dedupe package)