I have some data containing spelling errors. I'm correcting them and scoring how close the spelling is using the following code:
import pandas as pd
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
Q = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}
df_Q = pd.DataFrame(Q)
# Define the function that Corrects & Scores the Spelling
def Spelling(ask):
a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)
# List comprehension for all values of a
b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
return pd.Series(a + b)
# Apply the function that Corrects & Scores the Spelling
df_A = df_Q['one'].apply(Spelling)
# Get the column names on the A dataframe
c = len(df_A.columns) // 2
df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \
['Score_{}'.format(y) for y in range(c)]
# Join the Q & A dataframes
df_QA = df_Q.join(df_A)
This gives the result:
df_QA
one two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4 \
a potat0 po1ato potato tomato pear apple squash
b toma3o 2omato tomato potato pear apple squash
c s5uash squ0sh squash pear apple tomato potato
d ap8le 2pple apple pear tomato squash potato
e pea7 p3ar pear potato apple tomato squash
Score_0 Score_1 Score_2 Score_3 Score_4
a 0.833333 0.500000 0.400000 0.181818 0.166667
b 0.833333 0.333333 0.200000 0.181818 0.166667
c 0.833333 0.200000 0.181818 0.166667 0.166667
d 0.800000 0.222222 0.181818 0.181818 0.181818
e 0.750000 0.400000 0.444444 0.200000 0.200000
For row "e", "potato" is in row 1 and "apple" in row 2. However, apple got a higher score than potato. This is the wrong way round for my application.
How do I get the higher scoring results the be consistently to the left please?
Edit 1: I tried a simpler code:
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
Q = "pea7"
A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)
& got the same result:
A: ['pear', 'potato', 'apple', 'tomato', 'squash']
I also tried a simpler scoring code:
import difflib
S1 = difflib.SequenceMatcher(None, "pea7", "potato")
R1 = S1.ratio()
S2 = difflib.SequenceMatcher(None, "pea7", "apple")
R2 = S2.ratio()
& again I got the same result:
R1: 0.4
R2: 0.444
Edit 2 I tried it with fuzzywuzzy. I got the same result again since fuzzywuzzy depends on difflib:
from fuzzywuzzy import fuzz
R1 = fuzz.ratio("pea7", "potato")
R2 = fuzz.ratio("pea7", "apple")