How to replace string in python from a list of pos

2020-06-27 18:17发布

问题:

I have a column of data that looks like this:

df = pd.DataFrame({'Ex1':['apple','apple1','Peear','peAr','b$nana','Bananas'],
'Ex2': ['Applet','banan','apples','PAIR','banana','apple'],
'Ex3':['Pears', 'Banaa', 'Apple', 'apple1', 'pear', 'abanana]}); df

And then I have three arrays that identify misspellings of fruit types as the canonical fruit type:

apple = ['apple1','Applet','apples','Apple']
pear = ['Peear','peAr','PAIR','Pears','p3ar']
banana = ['b$nana','Bananas','banan','Banaa','abanana']

How can I iterate over each of the columns to change the misspelled fruit into the correct ones. I.e. the final data frame should look like this:

    Ex1     Ex2     Ex3
0   apple   apple   pear
1   apple   banana  banana
2   pear    apple   apple
3   pear    pear    apple
4   banana  banana  pear
5   banana  apple   banana

I know I could achieve this outcome with the following code:

replacements = {
    "apple":'apple1',
    "apple":'Applet',
...}

df['Ex1'].replace(replacements, inplace=True)

But I have a list of 1000+ rows and I don't want go through and make each replacement in replacements because that will take a lot of time.

Any suggestions for doing this in a way that I can use my apple, pear, and banana variables as-is?

回答1:

The simple (perhaps even simplistic) approach involving the handwritten lists of misspellings can be automated merely by constructing the dictionary from the lists:

repl={s:n for n,l in [("apple",apple),("pear",pear),("banana",banana)]
      for s in l}

The list of correct names and misspellings for each can itself be constructed automatically if they reside in some data structure like a containing dictionary. (It’s possible to use globals() or locals() as that dictionary, but then you have to filter out the extraneous entries.)



回答2:

A more accurate solution would be to compute the ratio of similarity between the misspelled word and the correctly spelled word. Among the few libraries available in Python, I used the Levenshtein library that has a ratio function that returns the similarity ratio. To get the ratio is quite simple, example:

from Levenshtein import ratio
ratio('banana', 'Banaa')
#0.7272727272727273

Now, if we have the following list of correct words correct_words, the ratio will be computed between each word in the series and in correct_words.

correct_words = ['apple', 'pear', 'banana']

This would mean each element will have three ratio values. However, we would only be concerned with the maximum ratio value and the correct word associated with it. The similarity function below creates an intermediate dictionary with ratio values and correct words(as key). The function returns the key with the max value. Finally, we map the key returned by the function into each element of the dataframe.

from Levenshtein import ratio
import operator

def similarity(x):
    l = {}    
    for i in correct_words:
        l[i] = ratio(x,i)
    return max(l.items(), key=operator.itemgetter(1))[0]


df.applymap(similarity)
    Ex1     Ex2     Ex3
0   apple   apple   pear
1   apple   banana  banana
2   pear    apple   apple
3   pear    apple   apple
4   banana  banana  pear
5   banana  apple   banana