The following code prints out leaf
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
This may or may not be accurate depending on the surrounding context, e.g. Mary leaves the room
vs. Dew drops fall from the leaves
. How can I tell NLTK to lemmatize words taking surrounding context into account?
First tag the sentence, then use the POS tag as the additional parameter input for the lemmatization.
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
def penn2morphy(penntag):
""" Converts Penn Treebank tags to WordNet. """
morphy_tag = {'NN':'n', 'JJ':'a',
'VB':'v', 'RB':'r'}
return morphy_tag[penntag[:2]]
return 'n'
def lemmatize_sent(text):
# Text input is string, returns lowercased strings.
return [wnl.lemmatize(word.lower(), pos=penn2morphy(tag))
for word, tag in pos_tag(word_tokenize(text))]
lemmatize_sent('He is walking to school')
For a detailed walkthrough of how and why the POS tag is necessary see
Alternatively, you can use pywsd
tokenizer + lemmatizer, a wrapper of NLTK's WordNetLemmatizer
pip install -U nltk
python -m nltk.downloader popular
pip install -U pywsd
>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 9.307677984237671 secs.
>>> text = "Mary leaves the room"
>>> lemmatize_sentence(text)
['mary', 'leave', 'the', 'room']
>>> text = 'Dew drops fall from the leaves'
>>> lemmatize_sentence(text)
['dew', 'drop', 'fall', 'from', 'the', 'leaf']