I'm a newbie in NLP and Python. I'm trying to extract a subset of noun phrases from parsed trees from StanfordCoreNLP by using the Tregex tool and the Python subprocess library. In particular, I'm trying to find and extract noun phrases that match the following pattern: '(NP[$VP]>S)|(NP[$VP]>S\n)|(NP\n[$VP]>S)|(NP\n[$VP]>S\n)' in the Tregex grammar.
For example, below is the original text, saved in a string named "text":
text = ('Pusheen and Smitha walked along the beach. "I want to surf", said Smitha, the CEO of Tesla. However, she fell off the surfboard')
After running the StanfordCoreNLP parser using the Python wrapper, I got the following 3 trees for the 3 sentences:
output1['sentences'][0]['parse']
Out[58]: '(ROOT\n (S\n (NP (NNP Pusheen)\n (CC and)\n (NNP Smitha))\n (VP (VBD walked)\n (PP (IN along)\n (NP (DT the) (NN beach))))\n (. .)))'
output1['sentences'][1]['parse']
Out[59]: "(ROOT\n (SINV (`` ``)\n (S\n (NP (PRP I))\n (VP (VBP want)\n (PP (TO to)\n (NP (NN surf) ('' '')))))\n (, ,)\n (VP (VBD said))\n (NP\n (NP (NNP Smitha))\n (, ,)\n (NP\n (NP (DT the) (NNP CEO))\n (PP (IN of)\n (NP (NNP Tesla)))))\n (. .)))"
output1['sentences'][2]['parse']
Out[60]: '(ROOT\n (S\n (ADVP (RB However))\n (, ,)\n (NP (PRP she))\n (VP (VBD fell)\n (PRT (RP off))\n (NP (DT the) (NN surfboard)))))'
I would like to extract the following 3 noun phrases (one for each sentence) and save them as variables (or lists of tokens) in Python:
- (NP (NNP Pusheen) \n (CC and) \n (NNP Smitha))
- (NP (PRP I))
- (NP (PRP she))
For your information, I have used of tregex from the command-line with the following code:
cd stanford-tregex-2016-10-31
java -cp 'stanford-tregex.jar:' edu.stanford.nlp.trees.tregex.TregexPattern -f -s '(NP[$VP]>S)|(NP[$VP]>S\n)|(NP\n[$VP]>S)|(NP\n[$VP]>S\n)' /Users/AS/stanford-tregex-2016-10-31/exampletree.txt
The output was:
Pattern string:
(NP[$VP]>S)|(NP[$VP]>S\n)|(NP\n[$VP]>S)|(NP\n[$VP]>S\n)
Parsed representation:
or
Root NP
and
$ VP
> S
Root NP
and
$ VP
> S\n
Root NP\n
and
$ VP
> S
Root NP\n
and
$ VP
> S\n
Reading trees from file(s) file path
\# /Users/AS/stanford-tregex-2016-10-31/exampletree.txt
(NP (NNP Pusheen) \n (CC and) \n (NNP Smitha))
\# /Users/AS/stanford-tregex-2016-10-31/exampletree.txt
(NP\n (NP (NNP Smitha)) \n (, ,) \n (NP\n (NP (DT the) (NN spokesperson)) \n (PP (IN of) \n (NP (DT the) (NNP CIA)))) \n (, ,))
\# /Users/AS/stanford-tregex-2016-10-31/exampletree.txt
(NP (PRP They))
There were 3 matches in total.
How can I replicate this result in Python?
For your reference, I found the following post via Google, which is relevant to my question but outdated (https://mailman.stanford.edu/pipermail/parser-user/2010-July/000606.html):
[parser-user] Variable input to Tregex
Christopher Manning manning at stanford.edu Wed Jul 7 17:41:32 PDT 2010 Hi Haiyang,
Sorry, slow reply, things are too busy at the end of the academic year.
On Jun 1, 2010, at 8:56 PM, Haiyang AI wrote:
Dear All,
I hope this is the right place to seek help.
It is, though we can only give very limited help on anything Python specific.....
But this seems to be straightforward (I think).
If what you're wanting is for the pattern to be run on trees being fed in over stdin, you need to add the flag "-filter" in the argument list prior to "NP".
If no file is specified after the pattern, and the flag "-filter" is not given, then it runs the pattern on a fixed default sentence....
Chris.
I'm working on a project related to Tregex. I'm trying to call Tregex from python, but I don't know how to feed data into Tregex, not from conventional file, but from a variable. For example, I'm trying to count the number of "NP" from a given variable (e.g. text, already parsed tree, using Stanford Parser), with the following code,
def tregex(text):
tregex_dir = "/root/nlp/stanford-tregex-2009-08-30/" op = Popen(["java", "-mx900m", "-cp", "stanford-tregex.jar:", "edu.stanford.nlp.trees.tregex.TregexPattern", "NP"], cwd = tregex_dir, stdout = PIPE, stdin = PIPE, stderr = STDOUT) res = op.communicate(input=text)[0] return resThe results are like the following. It didn't search the content from the variable, but somehow falling back to "using default tree". Can anyone give me a hand? I have been stuck here for quite a long time. Really appreciate your time and help. Pattern string: NP Parsed representation: Root NP using default tree (NP (NP (DT this) (NN wine)) (CC and) (NP (DT these) (NNS snails)))
(NP (DT this) (NN wine))
(NP (DT these) (NNS snails))
There were 3 matches in total.
-- Haiyang AI, Ph.D. student Department of Applied Linguistics The Pennsylvania State University
parser-user mailing list parser-user at lists.stanford.edu https://mailman.stanford.edu/mailman/listinfo/parser-user
Why not use the Stanford CoreNLP server!
1.) Start up the server!
2.) Make a python request!
3.) Here are the results!