I'm trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same.
My code so far is as follows:
import pandas as pd
from pandas import DataFrame
from fuzzywuzzy import process
import csv
save_file = open('fuzzy_match_results.csv', 'w')
writer = csv.writer(save_file, lineterminator = '\n')
def parse_csv(path):
with open(path,'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
yield row
if __name__ == "__main__":
## Create lookup dictionary by parsing the products csv
data = {}
for row in parse_csv('names_1.csv'):
data[row[0]] = row[0]
## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):
#print(process.extract(row,data, limit = 100))
for found, score, matchrow in process.extract(row, data, limit=100):
if score >= 60:
print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
Digi_Results = [row, score, found]
writer.writerow(Digi_Results)
save_file.close()
The output is as follows:
Name11 , 90 , Name25
Name11 , 85 , Name24
Name11 , 65 , Name29
The script works fine. The output is as expected. But what I am looking for is only the best match.
Name11 , 90 , Name25
Name12 , 95 , Name21
Name13 , 98 , Name22
So I need to somehow drop the duplicated names in column 1, based on the highest value in column 2. It should be fairly straightforward, but I can't seem to figure it out. Any help would be appreciated.
fuzzywuzzy's
process.extract()
returns the list in reverse sorted order , with the best match coming first.so to find just the best match, you can set the limit argument as
1
, so that it only returns the best match, and if that is greater than 60 , you can write it to the csv, like you are doing now.Example -
I just wrote the same thing for myself but in pandas....
I have used a threshold of 50 in this - but it is configurable.
Dataframe1 looks like
And Dataframe2 looks like
So running it produces the matches of
Hope this helps.
Several pieces of your code can be greatly simplified by using
process.extractOne()
from FuzzyWuzzy. Not only does it just return the top match, you can set a score threshold for it within the function call, rather than needing to perform a separate logical step, e.g.:This function will return a tuple of the highest match plus the accompanying score if it finds a match satisfying the condition. It will return
None
otherwise.