Is there any way I can get the Universal dependencies using python, or nltk?I can only produce the parse tree.
Example:
Input sentence:
My dog also likes eating sausage.
Output:
Universal dependencies
nmod:poss(dog-2, My-1)
nsubj(likes-4, dog-2)
advmod(likes-4, also-3)
root(ROOT-0, likes-4)
xcomp(likes-4, eating-5)
dobj(eating-5, sausage-6)
Wordseer's stanford-corenlp-python fork is a good start as it works with the recent CoreNLP release (3.5.2). However it will give you raw output, which you need manually transform. For example, given you have the wrapper running:
>>> import json, jsonrpclib
>>> from pprint import pprint
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>>
>>> pprint(json.loads(server.parse('John loves Mary.'))) # doctest: +SKIP
{u'sentences': [{u'dependencies': [[u'root', u'ROOT', u'0', u'loves', u'2'],
[u'nsubj',
u'loves',
u'2',
u'John',
u'1'],
[u'dobj', u'loves', u'2', u'Mary', u'3'],
[u'punct', u'loves', u'2', u'.', u'4']],
u'parsetree': [],
u'text': u'John loves Mary.',
u'words': [[u'John',
{u'CharacterOffsetBegin': u'0',
u'CharacterOffsetEnd': u'4',
u'Lemma': u'John',
u'PartOfSpeech': u'NNP'}],
[u'loves',
{u'CharacterOffsetBegin': u'5',
u'CharacterOffsetEnd': u'10',
u'Lemma': u'love',
u'PartOfSpeech': u'VBZ'}],
[u'Mary',
{u'CharacterOffsetBegin': u'11',
u'CharacterOffsetEnd': u'15',
u'Lemma': u'Mary',
u'PartOfSpeech': u'NNP'}],
[u'.',
{u'CharacterOffsetBegin': u'15',
u'CharacterOffsetEnd': u'16',
u'Lemma': u'.',
u'PartOfSpeech': u'.'}]]}]}
In case you want to use dependency parser, you can reuse NLTK's DependencyGraph with a bit of effort
>>> import jsonrpclib, json
>>> from nltk.parse import DependencyGraph
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>> parses = json.loads(
... server.parse(
... 'John loves Mary. '
... 'I saw a man with a telescope. '
... 'Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.'
... )
... )['sentences']
>>>
>>> def transform(sentence):
... for rel, _, head, word, n in sentence['dependencies']:
... n = int(n)
...
... word_info = sentence['words'][n - 1][1]
... tag = word_info['PartOfSpeech']
... lemma = word_info['Lemma']
... if rel == 'root':
... # NLTK expects that the root relation is labelled as ROOT!
... rel = 'ROOT'
...
... # Hack: Return values we don't know as '_'.
... # Also, consider tag and ctag to be equal.
... # n is used to sort words as they appear in the sentence.
... yield n, '_', word, lemma, tag, tag, '_', head, rel, '_', '_'
...
>>> dgs = [
... DependencyGraph(
... ' '.join(items) # NLTK expects an iterable of strings...
... for n, *items in sorted(transform(parse))
... )
... for parse in parses
... ]
>>>
>>> # Play around with the information we've got.
>>>
>>> pprint(list(dgs[0].triples()))
[(('loves', 'VBZ'), 'nsubj', ('John', 'NNP')),
(('loves', 'VBZ'), 'dobj', ('Mary', 'NNP')),
(('loves', 'VBZ'), 'punct', ('.', '.'))]
>>>
>>> print(dgs[1].tree())
(saw I (man a (with (telescope a))) .)
>>>
>>> print(dgs[2].to_conll(4)) # doctest: +NORMALIZE_WHITESPACE
Ballmer NNP 4 nsubj
has VBZ 4 aux
been VBN 4 cop
vocal JJ 0 ROOT
in IN 4 prep
the DT 8 det
past JJ 8 amod
warning NN 5 pobj
that WDT 13 dobj
Linux NNP 13 nsubj
is VBZ 13 cop
a DT 13 det
threat NN 8 rcmod
to TO 13 prep
Microsoft NNP 14 pobj
. . 4 punct
<BLANKLINE>
Setting up CoreNLP is not that hard, check http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html for more details.