What kind of processing should be done to the input which is given to the parser.
As of know i am using stanford parser.jar but there is also stanford coreNLP.jar what is the difference between parser.jar and coreNLP.jar parsing method
As per coreNLP documentation you can pass the operation you want to do as input in the annotators
COMMAND:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt
To use parsing in coreNLP can i pass only parse or should I pass all the annotators except dcoref
i.e.)
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse -file input.txt
or
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt
Does the parser.jar has sentence splitting in built in it's jar
Can I give paragraph as input and get sentence and their parsed data as out
or should i give only one sentence at a time
Thank you,
The CoreNLP annotators can be thought of as a dependency graph. The parser annotator depends on tokenization (
tokenize
) and sentence splitting (ssplit
) only. So, you could run the parser with your first command:If you know your text is pre-tokenized, the easiest thing to do is to set the options
tokenize.whitespace = "true"
in your properties file (or pass it in as a flag:-tokenize.whitespace
). To only sentence split at the end of a line, you can set the option (ssplit.eolonly
).But, by default, yes CoreNLP will tokenize and split up your sentence for you. You can just feed in a pile of text, and it will output parsed sentences.