Force Stanford CoreNLP Parser to Prioritize 'S

2019-08-27 23:40发布

问题:

Greetings NLP Experts,

I am using the Stanford CoreNLP software package to produce constituency parses, using the most recent version (3.9.2) of the English language models JAR, downloaded from the CoreNLP Download page. I access the parser via the Python interface from the NLTK module nltk.parse.corenlp. Here is a snippet from the top of my main module:

import nltk
from nltk.tree import ParentedTree
from nltk.parse.corenlp import CoreNLPParser

parser = CoreNLPParser(url='http://localhost:9000')

I also fire up the server using the following (fairly generic) call from the terminal:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
-annotators "parse" -port 9000 -timeout 30000

The parser that CoreNLP selects by default (when the full English model is available) is the Shift-Reduce (SR) parser, which is sometimes claimed to be both more accurate and faster than the CoreNLP PCFG parser. Impressionistically, I can corroborate that with my own experience, where I deal almost exclusively with Wikipedia text.

However, I have noticed that often the parser will erroneously opt for parsing what is in fact a complete sentence (i.e., a finite, matrix clause) as a subsentential constituent instead, often an NP. In other words, the parser should be outputting an S label at root level (ROOT (S ...)), but something in the complexity of the sentence's syntax pushes the parser to say a sentence is not a sentence (ROOT (NP ...)), etc.

The parses for such problem sentences also always contain another (usually glaring) error further down in the tree. Below are a few examples. I'll just paste in the top few levels of each tree to save space. Each is a perfectly acceptable English sentence, and so the parses should all begin (ROOT (S ...)). However, in each case some other label takes the place of S, and the rest of the tree is garbled.

NP: An estimated 22–189 million school days are missed annually due to a cold. (ROOT (NP (NP An estimated 22) (: --) (S 189 million school days are missed annually due to a cold) (. .)))

FRAG: More than one-third of people who saw a doctor received an antibiotic prescription, which has implications for antibiotic resistance. (ROOT (FRAG (NP (NP More than one-third) (PP of people who saw a doctor received an antibiotic prescription, which has implications for antibiotic resistance)) (. .)))

UCP: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. (ROOT (UCP (S Coffee is a brewed drink prepared from roasted coffee beans) (, ,) (NP the seeds of berries from certain Coffea species) (. .)))

At long last, here is my question, which I trust the above evidence proves is a useful one: Given that my data contains a negligible number of fragments or otherwise ill-formed sentences, how can I impose a high-level constraint on the CoreNLP parser such that its algorithm gives priority to assigning an S node directly below ROOT?

I am curious to see whether imposing such a constraint when processing data (that one knows to satisfy it) will also cure other myriad ills observed in the parses produced. From what I understand, the solution would not lie in specifying a ParserAnnotations.ConstraintAnnotation. Would it?

回答1:

You can specify a certain range has to be marked a certain way. So you can say the entire range has to be an 'S'. But I think you have to do this in Java code.

Here is example code that shows setting constraints.

https://github.com/stanfordnlp/CoreNLP/blob/master/itest/src/edu/stanford/nlp/parser/shiftreduce/ShiftReduceParserITest.java