How to un-stem a word in Python?

2020-07-08 06:47发布

问题:

I want to know if there is anyway that I can un-stem them to a normal form?

The problem is that I have thousands of words in different forms e.g. eat, eaten, ate, eating and so on and I need to count the frequency of each word. All of these - eat, eaten, ate, eating etc will count towards eat and hence, I used stemming.

But the next part of the problem requires me to find similar words in data and I am using nltk's synsets to calculate Wu-Palmer Similarity among the words. The problem is that nltk's synsets wont work on stemmed words, or at least in this code they won't. check if two words are related to each other

How should I do it? Is there a way to un-stem a word?

回答1:

I suspect what you really mean by stem is "tense". As in you want the different tense of each word to each count towards the "base form" of the verb.

check out the pattern package

pip install pattern

Then use the en.lemma function to return a verb's base form.

import pattern.en as en
base_form = en.lemma('ate') # base_form == "eat"


回答2:

No, there isn't. With stemming, you lose information, not only about the word form (as in eat vs. eats or eaten), but also about the word itself (as in tradition vs. traditional). Unless you're going to use a prediction method to try and predict this information on the basis of the context of the word, there's no way to get it back.



回答3:

I think an ok approach is something like said in https://stackoverflow.com/a/30670993/7127519.

A possible implementations could be something like this:

import re
import string
import nltk
import pandas as pd
stemmer = nltk.stem.porter.PorterStemmer()

An stemmer to use. Here a text to use:

complete_text = ''' cats catlike catty cat 
stemmer stemming stemmed stem 
fishing fished fisher fish 
argue argued argues arguing argus argu 
argument arguments argument '''

Create a list with the different words:

my_list = []
#for i in complete_text.decode().split():
try: 
    aux = complete_text.decode().split()
except:
    aux = complete_text.split()
for i in aux:
    if i not in my_list:
        my_list.append(i.lower())
my_list

with output:

['cats',
 'catlike',
 'catty',
 'cat',
 'stemmer',
 'stemming',
 'stemmed',
 'stem',
 'fishing',
 'fished',
 'fisher',
 'fish',
 'argue',
 'argued',
 'argues',
 'arguing',
 'argus',
 'argu',
 'argument',
 'arguments']

An now create the dictionary:

aux = pd.DataFrame(my_list, columns =['word'] )
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
aux.index = aux['word_stemmed']
del aux['word_stemmed']
my_dict = aux.to_dict('dict')['word']
my_dict

Which output is:

{'argu': 'argue, argued, argues, arguing, argus, argu',
 'argument': 'argument, arguments',
 'cat': 'cats, cat',
 'catlik': 'catlike',
 'catti': 'catty',
 'fish': 'fishing, fished, fish',
 'fisher': 'fisher',
 'stem': 'stemming, stemmed, stem',
 'stemmer': 'stemmer'}

Companion notebook here.



回答4:

Theoretically the only way to unstem is if prior to stemming you kept a dictionary of terms or a mapping of any kind and carry on this mapping to your rest of your computations. This mapping should somehow capture the place of your unstemmed token and when there is a need to unstemm a token given that you know the original place of your stemmed token you would be able to trace back and recover the original unstemmed representation with your mapping.

For the Bag of Words representation this seems computationally intensive and somehow defeats the purpose of the statistical nature of the BoW approach.

But again theoretically I believe it could work. I haven't seen that though in any implementation.



回答5:

tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.

You may like this open-source project which uses Stemming and contains an algorithm to do Inverse Stemming:

  • https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA

On this page of the project, there are explanations on how to do the Inverse Stemming. To sum things up, it works like as follow.

First, you will stem some documents, here short (French language) strings with their stop words removed for example: ['sup chat march trottoir', 'sup chat aiment ronron', 'chat ronron', 'sup chien aboi', 'deux sup chien', 'combien chien train aboi']

Then the trick is to have kept the count of the most popular original words with counts for each stemmed word: {'aboi': {'aboie': 1, 'aboyer': 1}, 'aiment': {'aiment': 1}, 'chat': {'chat': 1, 'chats': 2}, 'chien': {'chien': 1, 'chiens': 2}, 'combien': {'Combien': 1}, 'deux': {'Deux': 1}, 'march': {'marche': 1}, 'ronron': {'ronronner': 1, 'ronrons': 1}, 'sup': {'super': 4}, 'train': {'train': 1}, 'trottoir': {'trottoir': 1}}

Finally, you may now guess how to implement this by yourself. Simply take the original words for which there was the most counts given a stemmed word. You can refer to the following implementation, which is available under the MIT License as part of the Multilingual-Latent-Dirichlet-Allocation-LDA project:

  • https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/lda_service/logic/stemmer.py

Improvements could be made by ditching the non-top reverse words (by using a heap for example) which would yield just one dict in the end instead of a dict of dicts.



标签: python nlp nltk