generally A head of a nounphrase is a noun which is rightmost of the NP as shown below tree is the head of the parent NP. So
ROOT | S ___|________________________ NP | ___|_____________ | | PP VP | ____|____ ____|___ NP | NP | PRT ___|_______ | | | | DT JJ NN NN IN NNP VBD RP | | | | | | | | The old oak tree from India fell down
Out[40]: Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['The']), Tree('JJ', ['old']), Tree('NN', ['oak']), Tree('NN', ['tree'])]), Tree('PP', [Tree('IN', ['from']), Tree('NP', [Tree('NNP', ['India'])])])]), Tree('VP', [Tree('VBD', ['fell']), Tree('PRT', [Tree('RP', ['down'])])])])
The following code based on a java implementation uses a simplistic rule to find the head of the NP , but i need to be based on the rules:
parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
def traverse(t):
try:
t.label()
except AttributeError:
return
else:
if t.label()=='NP':
print 'NP:'+str(t.leaves())
print 'NPhead:'+str(t.leaves()[-1])
for child in t:
traverse(child)
else:
for child in t:
traverse(child)
tree=Tree.fromstring(parsestr)
traverse(tree)
The above code gives output:
NP:['The', 'old', 'oak', 'tree', 'from', 'India'] NPhead:India NP:['The', 'old', 'oak', 'tree'] NPhead:tree NP:['India'] NPhead:India
Although now its giving correct output for the sentence given but I need to incorporate a condition that only right most noun is extracted as head , currently it does not check if it were a noun (NN)
print 'NPhead:'+str(t.leaves()[-1])
So something like following in the np head condition in above code:
t.leaves().getrightmostnoun()
Michael Collins dissertation (Appendix A) includes head-finding rules for the Penn Treebank, and hence it is not necessary that only the rightmost noun is the head. Hence the above conditions should incorporate such scenario.
For the following example as given in one of the answers:
(NP (NP the person) that gave (NP the talk)) went home
The head noun of the subject is person but the last leave node of the NP the person that gave the talk is talk.
I was looking for a python script using NLTK that does this task and stumbled across this post. Here's the solution I came up with. It's a little bit noisy and arbitrary, and definitely doesn't always pick the right answer (e.g. for compound nouns). But I wanted to post it in case it was helpful for others to have a solution that mostly works.
For the examples discussed in the question and in the other answers, this is the output:
There are built-in string to
Tree
object in NLTK (http://www.nltk.org/_modules/nltk/tree.html), see https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L541.Note that it's not always the case that right most noun is the head noun of an NP, e.g.
Arguably,
Magnificent
can still be the head noun. Another example is when the NP includes a relative clause:The head noun of the subject is
person
but the last leave node of the NPthe person that gave the talk
istalk
.