I am using python3.5 with the nltk pos_tag function and the WordNetLemmatizer. My goal is to flatten words in our database to classify text. I am trying to test using the lemmatizer and I encounter strange behavior when using the POS tagger on identical tokens. In the example below, I have a list of three strings and when running them in the POS tagger every other element is returned as a noun(NN) and the rest are return as verbs (VBG).
This affects the lemmatization. The out put looks like this:
pos Of token: v
lemmatized token: skydive
pos Of token: n
lemmatized token: skydiving
pos Of token: v
lemmatized token: skydive
If I add more elements to the list of identical strings this same pattern continues. The code in full I am using is this:
tokens = ['skydiving', 'skydiving', 'skydiving']
lmtzr=WordNetLemmatizer()
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return 'a'
elif treebank_tag.startswith('V'):
return 'v'
elif treebank_tag.startswith('N'):
return 'n'
elif treebank_tag.startswith('R'):
return 'r'
elif treebank_tag.startswith('S'):
return ''
else:
return ''
numTokens = (len(tokens))
for i in range(0,numTokens):
tokens[i]=tokens[i].replace(" ","")
noSpaceTokens = pos_tag(tokens)
for token in noSpaceTokens:
tokenStr = str(token[1])
noWhiteSpace = token[0].replace(" ", "")
preLemmed = get_wordnet_pos(tokenStr)
print("pos Of token: " + preLemmed)
lemmed = lmtzr.lemmatize(noWhiteSpace,preLemmed)
print("lemmatized token: " + lemmed)
In short:
When POS tagging you need a context sentence not a list of ungrammatical tokens.
When lemmatizing out of context sentence, the only way to get the right lemma is to manually specify the pos tags.
pos
parameter for the lemmatize function.n
POS, see also WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTKIn long:
POS tagger usually works on the full sentence and not individual words. When you try to tag a single word out of context, what you get is the most frequent tag.
To verify that when tagging a single word (i.e. a sentence with only 1 word), it always gives the same tag:
Now, since the tag is always 'a' by default if the sentence only have 1 word, then the
WordNetLemmatizer
will always returnskydive
:Let's to to see the lemma of a word in context of a sentence:
So the context of the input list of tokens matters when you do
pos_tag
.In your example, you had a list
['skydiving', 'skydiving', 'skydiving']
meaning the sentence that you are pos-tagging is an ungrammatical sentence:And the
pos_tag
function thinks is a normal sentence hence giving the tags:In which case the first is a verb, the second word a noun and the third word a verb, which will return the following lemma (which you do not desire):
So if we have a valid grammatical sentence in your list of token, the output might look very different