可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I used NLTK's ne_chunk
to extract named entities from a text:
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nltk.ne_chunk(my_sent, binary=True)
But I can't figure out how to save these entities to a list? E.g. –
print Entity_list
('WASHINGTON', 'New York', 'Loretta', 'Brooklyn', 'African')
Thanks.
回答1:
nltk.ne_chunk
returns a nested nltk.tree.Tree
object so you would have to traverse the Tree
object to get to the NEs.
Take a look at Named Entity Recognition with Regular Expression: NLTK
>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> from nltk.tree import Tree
>>>
>>> def get_continuous_chunks(text):
... chunked = ne_chunk(pos_tag(word_tokenize(text)))
... continuous_chunk = []
... current_chunk = []
... for i in chunked:
... if type(i) == Tree:
... current_chunk.append(" ".join([token for token, pos in i.leaves()]))
... elif current_chunk:
... named_entity = " ".join(current_chunk)
... if named_entity not in continuous_chunk:
... continuous_chunk.append(named_entity)
... current_chunk = []
... else:
... continue
... return continuous_chunk
...
>>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
>>> get_continuous_chunks(my_sent)
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']
回答2:
You can also extract the label
of each Name Entity in the text using this code:
import nltk
for sent in nltk.sent_tokenize(sentence):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(chunk.label(), ' '.join(c[0] for c in chunk))
Output:
GPE WASHINGTON
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
You can see Washington
, New York
and Brooklyn
are GPE
means geo-political entities
and Loretta E. Lynch
is a PERSON
回答3:
As you get a tree
as a return value, I guess you want to pick those subtrees that are labeled with NE
Here is a simple example to gather all those in a list:
import nltk
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
parse_tree = nltk.ne_chunk(nltk.tag.pos_tag(my_sent.split()), binary=True) # POS tagging before chunking!
named_entities = []
for t in parse_tree.subtrees():
if t.label() == 'NE':
named_entities.append(t)
# named_entities.append(list(t)) # if you want to save a list of tagged words instead of a tree
print named_entities
This gives:
[Tree('NE', [('WASHINGTON', 'NNP')]), Tree('NE', [('New', 'NNP'), ('York', 'NNP')])]
or as a list of lists:
[[('WASHINGTON', 'NNP')], [('New', 'NNP'), ('York', 'NNP')]]
Also see: How to navigate a nltk.tree.Tree?
回答4:
use tree2conlltags from nltk.chunk. Also ne_chunk needs pos tagging which tags word tokens (thus needs word_tokenize).
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags
sentence = "Mark and John are working at Google."
print(tree2conlltags(ne_chunk(pos_tag(word_tokenize(sentence))
"""[('Mark', 'NNP', 'B-PERSON'),
('and', 'CC', 'O'), ('John', 'NNP', 'B-PERSON'),
('are', 'VBP', 'O'), ('working', 'VBG', 'O'),
('at', 'IN', 'O'), ('Google', 'NNP', 'B-ORGANIZATION'),
('.', '.', 'O')] """
This will give you a list of tuples: [(token, pos_tag, name_entity_tag)]
If this list is not exactly what you want, it is certainly easier to parse the list you want out of this list then an nltk tree.
Code and details from this link; check it out for more information
You can also continue by only extracting the words, with the following function:
def wordextractor(tuple1):
#bring the tuple back to lists to work with it
words, tags, pos = zip(*tuple1)
words = list(words)
pos = list(pos)
c = list()
i=0
while i<= len(tuple1)-1:
#get words with have pos B-PERSON or I-PERSON
if pos[i] == 'B-PERSON':
c = c+[words[i]]
elif pos[i] == 'I-PERSON':
c = c+[words[i]]
i=i+1
return c
print(wordextractor(tree2conlltags(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence))))
Edit Added output docstring
**Edit* Added Output only for B-Person
回答5:
A Tree
is a list. Chunks are subtrees, non-chunked words are regular strings. So let's go down the list, extract the words from each chunk, and join them.
>>> chunked = nltk.ne_chunk(my_sent)
>>>
>>> [ " ".join(w for w, t in elt) for elt in chunked if isinstance(elt, nltk.Tree) ]
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']
回答6:
You may also consider using Spacy:
import spacy
nlp = spacy.load('en')
doc = nlp('WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement.')
print([ent for ent in doc.ents])
>>> [WASHINGTON, New York, the 1990s, Loretta E. Lynch, Brooklyn, African-Americans]