-->

ne_chunk without pos_tag in NLTK

2020-02-01 12:01发布

问题:

I'm trying to chunk a sentence using ne_chunk and pos_tag in nltk.

from nltk import tag
from nltk.tag import pos_tag
from nltk.tree import Tree
from nltk.chunk import ne_chunk

sentence = "Michael and John is reading a booklet in a library of Jakarta"
tagged_sent = pos_tag(sentence.split())

print_chunk = [chunk for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]

print print_chunk

and this is the result:

[Tree('GPE', [('Michael', 'NNP')]), Tree('PERSON', [('John', 'NNP')]), Tree('GPE', [('Jakarta', 'NNP')])]

my question, is it possible not to include pos_tag (like NNP above) and only include Tree 'GPE','PERSON'? and what 'GPE' means?

Thanks in advance

回答1:

The named entity chunker will give you a tree containing both chunks and tags. You can't change that, but you can take the tags out. Starting from your tagged_sent:

chunks = nltk.ne_chunk(tagged_sent)
simple = []
for elt in chunks:
    if isinstance(elt, Tree):
        simple.append(Tree(elt.label(), [ word for word, tag in elt ]))
    else:
        simple.append( elt[0] )

If you only want the chunks, omit the else: clause in the above. You can adapt the code to wrap the chunks any way you want. I used an nltk Tree to keep the changes to a minimum. Note that some chunks consist of multiple words (try adding "New York" to your example), so the chunk's contents must be a list, not a single element.

PS. "GPE" stands for "geo-political entity" (obviously a chunker mistake). You can see a list of the "commonly used tags" in the nltk book, here.



回答2:

Most probably a slight modification to the code on https://stackoverflow.com/a/31838373/610569 with the tags is what you require.

is it possible not to include pos_tag (like NNP above) and only include Tree 'GPE','PERSON'?

Yes, simply traverse the Tree object =) See How to Traverse an NLTK Tree object?

>>> from nltk import Tree, pos_tag, ne_chunk
>>> sentence = "Michael and John is reading a booklet in a library of Jakarta"
>>> tagged_sent = ne_chunk(pos_tag(sentence.split()))
>>> tagged_sent
Tree('S', [Tree('GPE', [('Michael', 'NNP')]), ('and', 'CC'), Tree('PERSON', [('John', 'NNP')]), ('is', 'VBZ'), ('reading', 'VBG'), ('a', 'DT'), ('booklet', 'NN'), ('in', 'IN'), ('a', 'DT'), ('library', 'NN'), ('of', 'IN'), Tree('GPE', [('Jakarta', 'NNP')])])

>>> from nltk.sem.relextract import NE_CLASSES
>>> ace_tags = NE_CLASSES['ace']

>>> for node in tagged_sent:
...     if type(node) == Tree and node.label() in ace_tags:
...         words, tags = zip(*node.leaves())
...         print node.label() + '\t' +  ' '.join(words)
... 
GPE Michael
PERSON  John
GPE Jakarta

What 'GPE' means?

GPE means "Geo-Political Entity"

  • The GPE tag came from the ACE dataset

  • There are two pre-trained NE chunkers available, see https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py#L164

  • There are 3 tag sets that are supported: https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L31

  • For a detailed explanation, see NLTK relation extraction returns nothing