I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any hint or help would be greatly appreciated.
from fuzzywuzzy import process
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
for match in ratio:
if match[1] != 100:
result.append(match)
break
print (result)
Output:
[('asple', 80), ('tab', 80)]
Major speed improvements come by writing vectorized operations and avoiding loops
Importing necessary package
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
Creating dataframe from first list
dataframecolumn = pd.DataFrame(["apple","tb"])
dataframecolumn.columns = ['Match']
Creating dataframe from second list
compare = pd.DataFrame(["adfad","apple","asple","tab"])
compare.columns = ['compare']
Merge - Cartesian product by introducing key(self join)
dataframecolumn['Key'] = 1
compare['Key'] = 1
combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]
Vectorization
def partial_match(x,y):
return(fuzz.ratio(x,y))
partial_match_vector = np.vectorize(partial_match)
Using vectorization and getting desired result by putting threshold on score
combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
combined_dataframe = combined_dataframe[combined_dataframe.score>=80]
Results
+--------+-----+--------+------+
| Match | Key | compare | score
+--------+-----+--------+------+
| apple | 1 | asple | 80
| tb | 1 | tab | 80
+--------+-----+--------+------+