I am using NER in NLTK to find persons, locations, and organizations in sentences. I am able to produce the results like this:
[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
Is that possible to chunk things together by using it? What I want is like this:
u'Remaking'/ u'O', u'The'/u'O', (u'Republican', u'Party')/u'ORGANIZATION'
Thanks!
It looks long but it does the work:
[out]:
For more details:
The first for-loop "with memory" achieves something like this:
You'll realize that all Name Enitties will have more than 2 items in a tuple and what you want are the words as the elements in the list, i.e.
'Republican Party'
in(u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')
, so you'll do something like this to get the even elements:Then you also realized that the last element in the NE tuple is the tag you want, so you would do `
It's a little ad-hoc and vebose but I hope it helps. And here it is in a function, Blessed Christmas:
You can use the standard NLTK way of representing chunks using nltk.Tree. This might mean that you have to change your representation a bit.
What I usually do is represent NER-tagged sentences as lists of triplets:
I do this when I use an external tool for NER tagging a sentence. Now you can transform this sentence the NLTK representation:
The change in representation kind of makes sense because you certainly need POS tags for NER tagging.
The end result should look like:
This is actually coming in the next release of CoreNLP, under the name
MentionsAnnotator
. It likely won't be directly available from NLTK, though, unless the NLTK people wish to support it along with the standard Stanford NER interface.In any case, for the moment you'll have to copy the code I've linked to (which uses
LabeledChunkIdentifier
for the dirty work) or write your own postprocessor in Python.Here is another short implementation for grouping the Stanford NER results using the groupby iterator of itertools:
The function grouptags has two options: