Using Stanford Tregex in Python

2019-01-26 21:20发布

问题:

I'm a newbie in NLP and Python. I'm trying to extract a subset of noun phrases from parsed trees from StanfordCoreNLP by using the Tregex tool and the Python subprocess library. In particular, I'm trying to find and extract noun phrases that match the following pattern: '(NP[$VP]>S)|(NP[$VP]>S\n)|(NP\n[$VP]>S)|(NP\n[$VP]>S\n)' in the Tregex grammar.

For example, below is the original text, saved in a string named "text":

text = ('Pusheen and Smitha walked along the beach. "I want to surf", said Smitha, the CEO of Tesla. However, she fell off the surfboard')

After running the StanfordCoreNLP parser using the Python wrapper, I got the following 3 trees for the 3 sentences:

output1['sentences'][0]['parse']

Out[58]: '(ROOT\n  (S\n    (NP (NNP Pusheen)\n      (CC and)\n      (NNP Smitha))\n    (VP (VBD walked)\n      (PP (IN along)\n        (NP (DT the) (NN beach))))\n    (. .)))'

output1['sentences'][1]['parse']

Out[59]: "(ROOT\n  (SINV (`` ``)\n    (S\n      (NP (PRP I))\n      (VP (VBP want)\n        (PP (TO to)\n          (NP (NN surf) ('' '')))))\n    (, ,)\n    (VP (VBD said))\n    (NP\n      (NP (NNP Smitha))\n      (, ,)\n      (NP\n        (NP (DT the) (NNP CEO))\n        (PP (IN of)\n          (NP (NNP Tesla)))))\n    (. .)))"

output1['sentences'][2]['parse']

Out[60]: '(ROOT\n  (S\n    (ADVP (RB However))\n    (, ,)\n    (NP (PRP she))\n    (VP (VBD fell)\n      (PRT (RP off))\n      (NP (DT the) (NN surfboard)))))'

I would like to extract the following 3 noun phrases (one for each sentence) and save them as variables (or lists of tokens) in Python:

  • (NP (NNP Pusheen) \n (CC and) \n (NNP Smitha))
  • (NP (PRP I))
  • (NP (PRP she))

For your information, I have used of tregex from the command-line with the following code:

cd stanford-tregex-2016-10-31
java -cp 'stanford-tregex.jar:' edu.stanford.nlp.trees.tregex.TregexPattern -f -s '(NP[$VP]>S)|(NP[$VP]>S\n)|(NP\n[$VP]>S)|(NP\n[$VP]>S\n)' /Users/AS/stanford-tregex-2016-10-31/exampletree.txt

The output was:

Pattern string:
(NP[$VP]>S)|(NP[$VP]>S\n)|(NP\n[$VP]>S)|(NP\n[$VP]>S\n)
Parsed representation:
or
   Root NP
      and
         $ VP
         > S
   Root NP
      and
         $ VP
         > S\n
   Root NP\n
      and
         $ VP
         > S
   Root NP\n
      and
         $ VP
         > S\n
Reading trees from file(s) file path
\# /Users/AS/stanford-tregex-2016-10-31/exampletree.txt
(NP (NNP Pusheen) \n (CC and) \n (NNP Smitha))
\# /Users/AS/stanford-tregex-2016-10-31/exampletree.txt
(NP\n (NP (NNP Smitha)) \n (, ,) \n (NP\n (NP (DT the) (NN spokesperson)) \n   (PP (IN of) \n (NP (DT the) (NNP CIA)))) \n (, ,))
\# /Users/AS/stanford-tregex-2016-10-31/exampletree.txt
(NP (PRP They))
There were 3 matches in total.

How can I replicate this result in Python?

For your reference, I found the following post via Google, which is relevant to my question but outdated (https://mailman.stanford.edu/pipermail/parser-user/2010-July/000606.html):

[parser-user] Variable input to Tregex

Christopher Manning manning at stanford.edu Wed Jul 7 17:41:32 PDT 2010 Hi Haiyang,

Sorry, slow reply, things are too busy at the end of the academic year.

On Jun 1, 2010, at 8:56 PM, Haiyang AI wrote:

Dear All,

I hope this is the right place to seek help.

It is, though we can only give very limited help on anything Python specific.....

But this seems to be straightforward (I think).

If what you're wanting is for the pattern to be run on trees being fed in over stdin, you need to add the flag "-filter" in the argument list prior to "NP".

If no file is specified after the pattern, and the flag "-filter" is not given, then it runs the pattern on a fixed default sentence....

Chris.

I'm working on a project related to Tregex. I'm trying to call Tregex from python, but I don't know how to feed data into Tregex, not from conventional file, but from a variable. For example, I'm trying to count the number of "NP" from a given variable (e.g. text, already parsed tree, using Stanford Parser), with the following code,

def tregex(text):
tregex_dir = "/root/nlp/stanford-tregex-2009-08-30/" op = Popen(["java", "-mx900m", "-cp", "stanford-tregex.jar:", "edu.stanford.nlp.trees.tregex.TregexPattern", "NP"], cwd = tregex_dir, stdout = PIPE, stdin = PIPE, stderr = STDOUT) res = op.communicate(input=text)[0] return res

The results are like the following. It didn't search the content from the variable, but somehow falling back to "using default tree". Can anyone give me a hand? I have been stuck here for quite a long time. Really appreciate your time and help. Pattern string: NP Parsed representation: Root NP using default tree (NP (NP (DT this) (NN wine)) (CC and) (NP (DT these) (NNS snails)))

(NP (DT this) (NN wine))

(NP (DT these) (NNS snails))

There were 3 matches in total.

-- Haiyang AI, Ph.D. student Department of Applied Linguistics The Pennsylvania State University


parser-user mailing list parser-user at lists.stanford.edu https://mailman.stanford.edu/mailman/listinfo/parser-user

回答1:

Why not use the Stanford CoreNLP server!

1.) Start up the server!

java -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 - timeout 15000

2.) Make a python request!

import requests

url = "http://localhost:9000/tregex"
request_params = {"pattern": "(NP[$VP]>S)|(NP[$VP]>S\\n)|(NP\\n[$VP]>S)|(NP\\n[$VP]>S\\n)"}
text = "Pusheen and Smitha walked along the beach."
r = requests.post(url, data=text, params=request_params)
print r.json()

3.) Here are the results!

{u'sentences': [{u'0': {u'namedNodes': [], u'match': u'(NP (NNP Pusheen)\n  (CC and)\n  (NNP Smitha))\n'}}]}