Absolute position of leaves in NLTK tree

I am trying to find the span (start index, end index) of a noun phrase in a given sentence. The following is the code for extracting noun phrases

sent=nltk.word_tokenize(a)
sent_pos=nltk.pos_tag(sent)
grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns

    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
    VP:
        {<VBD><PP>?}
        {<VBZ><PP>?}
        {<VB><PP>?}
        {<VBN><PP>?}
        {<VBG><PP>?}
        {<VBP><PP>?}
"""

cp = nltk.RegexpParser(grammar)
result = cp.parse(sent_pos)
nounPhrases = []
for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
  np = ''
  for x in subtree.leaves():
    np = np + ' ' + x[0]
  nounPhrases.append(np.strip())

For a = "The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.", the noun phrases extracted are

['American Civil War', 'War', 'States', 'Civil War', 'civil war fought', 'United States', 'several Southern', 'states', 'secession', 'Confederate States', 'America'].

Now I need to find the span (start position and end position of the phrase) of noun phrases. For example, the span of above noun phrases will be

[(1,3), (9,9), (12, 12), (16, 17), (21, 23), ....].

I'm fairly new to NLTK and I've looked into http://www.nltk.org/_modules/nltk/tree.html. I tried to use Tree.treepositions() but I couldn't manage to extract absolute positions using these indices. Any help would be greatly appreciated. Thank You!

标签： python tree nlp nltk chunking

2条回答

兄弟一词,经得起流年.

2楼-- · 2019-04-10 12:44

Here is another approach that augments the tokens with their absolute positions in a tree string. The absolute positions can now be extracted from the leaves of any subtree.

def add_indices_to_terminals(treestring):
    tree = ParentedTree.fromstring(treestring)
    for idx, _ in enumerate(tree.leaves()):
        tree_location = tree.leaf_treeposition(idx)
        non_terminal = tree[tree_location[:-1]]
        non_terminal[0] = non_terminal[0] + "_" + str(idx)
    return str(tree)

Example use case

>>> treestring = (S (NP (NNP John)) (VP (V runs)))
>>> add_indices_to_terminals(treestring)
(S (NP (NNP John_0)) (VP (V runs_1)))

0人赞添加讨论(0) 举报

一夜七次

3楼-- · 2019-04-10 12:49

There isn't any implicit function that returns the offsets of strings/tokens as highlighted by https://github.com/nltk/nltk/issues/1214

But you can use an ngram searcher that is used by the RIBES score from https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L123

>>> from nltk import word_tokenize
>>> from nltk.translate.ribes_score import position_of_ngram
>>> s = word_tokenize("The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.")
>>> position_of_ngram(tuple('American Civil War'.split()), s)
1
>>> position_of_ngram(tuple('Confederate States of America'.split()), s)
43

(It returns the starting position of the query ngram)

0人赞添加讨论(0) 举报

Absolute position of leaves in NLTK tree

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间