I have some data containing spelling errors. I'm correcting them and scoring how close the spelling is using the following code:
import pandas as pd
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
Q = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}
df_Q = pd.DataFrame(Q)
# Define the function that Corrects & Scores the Spelling
def Spelling(ask):
a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)
# List comprehension for all values of a
b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
return pd.Series(a + b)
# Apply the function that Corrects & Scores the Spelling
df_A = df_Q['one'].apply(Spelling)
# Get the column names on the A dataframe
c = len(df_A.columns) // 2
df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \
['Score_{}'.format(y) for y in range(c)]
# Join the Q & A dataframes
df_QA = df_Q.join(df_A)
This gives the result:
df_QA
one two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4 \
a potat0 po1ato potato tomato pear apple squash
b toma3o 2omato tomato potato pear apple squash
c s5uash squ0sh squash pear apple tomato potato
d ap8le 2pple apple pear tomato squash potato
e pea7 p3ar pear potato apple tomato squash
Score_0 Score_1 Score_2 Score_3 Score_4
a 0.833333 0.500000 0.400000 0.181818 0.166667
b 0.833333 0.333333 0.200000 0.181818 0.166667
c 0.833333 0.200000 0.181818 0.166667 0.166667
d 0.800000 0.222222 0.181818 0.181818 0.181818
e 0.750000 0.400000 0.444444 0.200000 0.200000
For row "e", "potato" is in row 1 and "apple" in row 2. However, apple got a higher score than potato. This is the wrong way round for my application.
How do I get the higher scoring results the be consistently to the left please?
Edit 1: I tried a simpler code:
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
Q = "pea7"
A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)
& got the same result:
A: ['pear', 'potato', 'apple', 'tomato', 'squash']
I also tried a simpler scoring code:
import difflib
S1 = difflib.SequenceMatcher(None, "pea7", "potato")
R1 = S1.ratio()
S2 = difflib.SequenceMatcher(None, "pea7", "apple")
R2 = S2.ratio()
& again I got the same result:
R1: 0.4
R2: 0.444
Edit 2 I tried it with fuzzywuzzy. I got the same result again since fuzzywuzzy depends on difflib:
from fuzzywuzzy import fuzz
R1 = fuzz.ratio("pea7", "potato")
R2 = fuzz.ratio("pea7", "apple")
SequenceMatcher is correctly calculating the ratio using the method described by Ratcliff and Metzener, 1988. That is, for the number of characters found in common (CC) and the total number of characters in the two strings (CT):
So it looks like the issue is with get_close_matches