How can use fuzzy matching in pandas to detect duplicate rows (efficiently)
How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?
How can use fuzzy matching in pandas to detect duplicate rows (efficiently)
How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?
Not pandas specific, but within the python ecosystem the dedupe python library would seem to do what you want. In particular, it allows you to compare each column of a row separately and then combine the information into a single probability score of a match.
There is now a package to make it easier to use the dedupe library with pandas: pandas-dedupe
(I am a developer of the original dedupe library, but not the pandas-dedupe package)