Greetings NLP Experts,
I am using the Stanford CoreNLP software package to produce constituency parses, using the most recent version (3.9.2) of the English language models JAR, downloaded from the CoreNLP Download page. I access the parser via the Python interface from the NLTK module nltk.parse.corenlp. Here is a snippet from the top of my main module:
import nltk
from nltk.tree import ParentedTree
from nltk.parse.corenlp import CoreNLPParser
parser = CoreNLPParser(url='http://localhost:9000')
I also fire up the server using the following (fairly generic) call from the terminal:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
-annotators "parse" -port 9000 -timeout 30000
The parser that CoreNLP selects by default (when the full English model is available) is the Shift-Reduce (SR) parser, which is sometimes claimed to be both more accurate and faster than the CoreNLP PCFG parser. Impressionistically, I can corroborate that with my own experience, where I deal almost exclusively with Wikipedia text.
However, I have noticed that often the parser will erroneously opt for parsing what is in fact a complete sentence (i.e., a finite, matrix clause) as a subsentential constituent instead, often an NP
. In other words, the parser should be outputting an S
label at root level (ROOT (S ...))
, but something in the complexity of the sentence's syntax pushes the parser to say a sentence is not a sentence (ROOT (NP ...))
, etc.
The parses for such problem sentences also always contain another (usually glaring) error further down in the tree. Below are a few examples. I'll just paste in the top few levels of each tree to save space. Each is a perfectly acceptable English sentence, and so the parses should all begin (ROOT (S ...))
. However, in each case some other label takes the place of S
, and the rest of the tree is garbled.
NP: An estimated 22–189 million school days are missed annually due to a cold.
(ROOT (NP (NP An estimated 22) (: --) (S 189 million school days are missed annually due to a cold) (. .)))
FRAG: More than one-third of people who saw a doctor received an antibiotic prescription, which has implications for antibiotic resistance.
(ROOT (FRAG (NP (NP More than one-third) (PP of people who saw a doctor received an antibiotic prescription, which has implications for antibiotic resistance)) (. .)))
UCP: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species.
(ROOT (UCP (S Coffee is a brewed drink prepared from roasted coffee beans) (, ,) (NP the seeds of berries from certain Coffea species) (. .)))
At long last, here is my question, which I trust the above evidence proves is a useful one: Given that my data contains a negligible number of fragments or otherwise ill-formed sentences, how can I impose a high-level constraint on the CoreNLP parser such that its algorithm gives priority to assigning an S
node directly below ROOT
?
I am curious to see whether imposing such a constraint when processing data (that one knows to satisfy it) will also cure other myriad ills observed in the parses produced. From what I understand, the solution would not lie in specifying a ParserAnnotations.ConstraintAnnotation
. Would it?