I'm trying to chunk a sentence using ne_chunk and pos_tag in nltk.
from nltk import tag
from nltk.tag import pos_tag
from nltk.tree import Tree
from nltk.chunk import ne_chunk
sentence = "Michael and John is reading a booklet in a library of Jakarta"
tagged_sent = pos_tag(sentence.split())
print_chunk = [chunk for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]
print print_chunk
and this is the result:
[Tree('GPE', [('Michael', 'NNP')]), Tree('PERSON', [('John', 'NNP')]), Tree('GPE', [('Jakarta', 'NNP')])]
my question, is it possible not to include pos_tag (like NNP above) and only include Tree 'GPE','PERSON'? and what 'GPE' means?
Thanks in advance
The named entity chunker will give you a tree containing both chunks and tags. You can't change that, but you can take the tags out. Starting from your
tagged_sent
:If you only want the chunks, omit the
else:
clause in the above. You can adapt the code to wrap the chunks any way you want. I used an nltkTree
to keep the changes to a minimum. Note that some chunks consist of multiple words (try adding "New York" to your example), so the chunk's contents must be a list, not a single element.PS. "GPE" stands for "geo-political entity" (obviously a chunker mistake). You can see a list of the "commonly used tags" in the nltk book, here.
Most probably a slight modification to the code on https://stackoverflow.com/a/31838373/610569 with the tags is what you require.
Yes, simply traverse the Tree object =) See How to Traverse an NLTK Tree object?
GPE means "Geo-Political Entity"
The
GPE
tag came from the ACE datasetThere are two pre-trained NE chunkers available, see https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py#L164
There are 3 tag sets that are supported: https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L31
For a detailed explanation, see NLTK relation extraction returns nothing