I have a column of data that looks like this:
df = pd.DataFrame({'Ex1':['apple','apple1','Peear','peAr','b$nana','Bananas'],
'Ex2': ['Applet','banan','apples','PAIR','banana','apple'],
'Ex3':['Pears', 'Banaa', 'Apple', 'apple1', 'pear', 'abanana]}); df
And then I have three arrays that identify misspellings of fruit types as the canonical fruit type:
apple = ['apple1','Applet','apples','Apple']
pear = ['Peear','peAr','PAIR','Pears','p3ar']
banana = ['b$nana','Bananas','banan','Banaa','abanana']
How can I iterate over each of the columns to change the misspelled fruit into the correct ones. I.e. the final data frame should look like this:
Ex1 Ex2 Ex3
0 apple apple pear
1 apple banana banana
2 pear apple apple
3 pear pear apple
4 banana banana pear
5 banana apple banana
I know I could achieve this outcome with the following code:
replacements = {
"apple":'apple1',
"apple":'Applet',
...}
df['Ex1'].replace(replacements, inplace=True)
But I have a list of 1000+ rows and I don't want go through and make each replacement in replacements
because that will take a lot of time.
Any suggestions for doing this in a way that I can use my apple
, pear
, and banana
variables as-is?
The simple (perhaps even simplistic) approach involving the handwritten lists of misspellings can be automated merely by constructing the dictionary from the lists:
repl={s:n for n,l in [("apple",apple),("pear",pear),("banana",banana)]
for s in l}
The list of correct names and misspellings for each can itself be constructed automatically if they reside in some data structure like a containing dictionary. (It’s possible to use globals()
or locals()
as that dictionary, but then you have to filter out the extraneous entries.)
A more accurate solution would be to compute the ratio of similarity between the misspelled word and the correctly spelled word. Among the few libraries available in Python, I used the Levenshtein library that has a ratio function that returns the similarity ratio. To get the ratio is quite simple, example:
from Levenshtein import ratio
ratio('banana', 'Banaa')
#0.7272727272727273
Now, if we have the following list of correct words correct_words
, the ratio will be computed between each word in the series and in correct_words
.
correct_words = ['apple', 'pear', 'banana']
This would mean each element will have three ratio values. However, we would only be concerned with the maximum ratio value and the correct word associated with it. The similarity
function below creates an intermediate dictionary with ratio values and correct words(as key). The function returns the key with the max value. Finally, we map the key returned by the function into each element of the dataframe.
from Levenshtein import ratio
import operator
def similarity(x):
l = {}
for i in correct_words:
l[i] = ratio(x,i)
return max(l.items(), key=operator.itemgetter(1))[0]
df.applymap(similarity)
Ex1 Ex2 Ex3
0 apple apple pear
1 apple banana banana
2 pear apple apple
3 pear apple apple
4 banana banana pear
5 banana apple banana