I am trying to wrote a keyword extraction program using Stanford POS taggers and NER. For keyword extraction, i am only interested in proper nouns. Here is the basic approach
- Clean up the data by removing anything but alphabets
- Remove stopwords
- Stem each word
- Determine POS tag of each word
- If the POS tag is a noun then feed it to the NER
- The NER will then determine if the word is a person, organization or location
sample code
docText="'Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London"
words = re.split("\W+",docText)
stops = set(stopwords.words("english"))
#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]
# Stemming
pstem = PorterStemmer()
words = [pstem.stem(w) for w in words]
nounsWeWant = set(['NN' ,'NNS', 'NNP', 'NNPS'])
finalWords = []
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
for w in words:
if stp.tag([w.lower()])[0][1] not in nounsWeWant:
finalWords.append(w.lower())
else:
finalWords.append(w)
finalString = " ".join(finalWords)
print finalString
tagged = stn.tag(finalWords)
print tagged
which gives me
Jack Frost work Boe Compani manag aircraft crew London
[(u'Jack', u'PERSON'), (u'Frost', u'PERSON'), (u'work', u'O'), (u'Boe', u'O'), (u'Compani', u'O'), (u'manag', u'O'), (u'aircraft', u'O'), (u'crew', u'O'), (u'London', u'LOCATION')]
so clearly, i did not want Boeing to be stemmed. nor Company. I need to stem the words as my input might contain terms like Performing
. I have seen that a word like Performing
will be picked up by the NER as a proper noun and hence could be categorized as Organization
. Hence, first i stem all the words and convert to lower case. Then i check to see if the POS tag of the word is a noun. If so, i keep it as is. If not, i convert the word to lower case and add it to the final word list that will be passed to the NER.
Any idea on how to avoid stemming proper nouns?
Use the full Stanford CoreNLP pipeline to handle your NLP tool chain. Avoid your own tokenizer, cleaner, POS tagger, etc. It will not play well with the NER tool.
[out]:
Or to get the json output:
And if you really need a python wrapper, see https://github.com/smilli/py-corenlp
Possibly this is the output you want:
If you want a wrapper that comes with NLTK, then you have to wait just a little longer until this issue is resolved ;P