Python Fuzzy matching strings in list performance

2020-07-22 18:07发布

问题:

I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any hint or help would be greatly appreciated.

from fuzzywuzzy import process
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
    for match in ratio:
        if match[1] != 100:
            result.append(match)
            break
print (result) 

Output: [('asple', 80), ('tab', 80)]

回答1:

Major speed improvements come by writing vectorized operations and avoiding loops

Importing necessary package

from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np

Creating dataframe from first list

dataframecolumn = pd.DataFrame(["apple","tb"])
dataframecolumn.columns = ['Match']

Creating dataframe from second list

compare = pd.DataFrame(["adfad","apple","asple","tab"])
compare.columns = ['compare']

Merge - Cartesian product by introducing key(self join)

dataframecolumn['Key'] = 1
compare['Key'] = 1
combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]

Vectorization

def partial_match(x,y):
    return(fuzz.ratio(x,y))
partial_match_vector = np.vectorize(partial_match)

Using vectorization and getting desired result by putting threshold on score

combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
combined_dataframe = combined_dataframe[combined_dataframe.score>=80]

Results

+--------+-----+--------+------+
| Match  | Key | compare | score
+--------+-----+--------+------+
| apple  | 1   |   asple |    80
|  tb    | 1   |   tab   |    80
+--------+-----+--------+------+