sentence parsing is running extremely slowly

2019-09-06 14:41发布

问题:

I'm attempting to create a Sentence Parser that can read in a document and predict the correct points to break up a sentence while not breaking on unimportant periods such as "Dr." or ".NET", so I've been attempting to use CoreNLP

Upon realizing that PCFG was running way too slowly (and essentially bottlenecking my entire job) I attempted to switch to Shift-Reduce parsing (which according to the coreNLP website is way faster).

However, the SRParser is running extremely slowly and I have no idea why (as PCFG is processing 1000 sentences per second, the SRParser is doing 100).

Here is the code for both. One thing that might be note-worthy is that each "document" has about 10-20 sentences, so they're very small:

PCFG parser:

class StanfordPCFGParser {
  val props = new Properties()
  props.put("annotators", "tokenize, ssplit, pos, lemma")
 val pipeline = new StanfordCoreNLP(props)
  var i = 0
  val time = java.lang.System.currentTimeMillis()

  def parseSentence(doc:String ):List[String] = {
    val tokens = new Annotation(doc)
    pipeline.annotate(tokens)
    val sentences = tokens.get(classOf[SentencesAnnotation]).toList
sentences.foreach(s =>{ if(i%1000==0) println("parsed " + i + "in " + (java.lang.System.currentTimeMillis() - time)/1000 + " seconds" ); i = i+ 1})
sentences.map(_.toString)
  }
}

Shift-Reduce Parser:

class StanfordShiftReduceParser {
  val p = new Properties()
  p.put("annotators", "tokenize ssplit pos parse lemma ")
  p.put("parse.model", "englishSR.ser.gz")
  val corenlp = new StanfordCoreNLP(p)
  var i = 0
  val time = java.lang.System.currentTimeMillis()

  def parseSentences(text:String) = {
    val annotation = new Annotation(text)
    corenlp.annotate(annotation)
    val sentences = annotation.get(classOf[SentencesAnnotation]).toList
    sentences.foreach(s =>{ if(i%1000==0) println("parsed " + i + "in " + (java.lang.System.currentTimeMillis() - time)/1000 + " seconds" ); i = i+ 1})
    sentences.map(_.toString)
  }
}

Here is the code I used for the timing:

val originalParser = new StanfordPCFGParser
println("starting PCFG")
var time = getTime
sentences.foreach(originalParser.parseSentence)
time = getTime - time
println("PCFG parser took " + time.asInstanceOf[Double] / 1000 + "seconds for 1000 documents to " + originalParser.i + "sentences")
val srParser = new StanfordShiftReduceParser
println("starting SRParse")
time = getTime()
sentences.foreach(srParser.parseSentences)
time = getTime - time
println("SR parser took " + time.asInstanceOf[Double] / 1000 + "seconds for 1000 documents to " + srParser.i + "sentences")

Which gives me the following output (I've parsed out the "Untokenizable" warnings which happen because of questionable data sources)

Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... starting PCFG
done [0.6 sec].
Adding annotator lemma
parsed 0in 0 seconds
parsed 1000in 1 seconds
parsed 2000in 2 seconds
parsed 3000in 3 seconds
parsed 4000in 5 seconds
parsed 5000in 5 seconds
parsed 6000in 6 seconds
parsed 7000in 7 seconds
parsed 8000in 8 seconds
parsed 9000in 9 seconds
PCFG parser took 10.158 seconds for 1000 documents to 9558 sentences
Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Adding annotator parse
Loading parser from serialized file englishSR.ser.gz ... done [8.3 sec].
starting SRParse
Adding annotator lemma
parsed 0in 0 seconds
parsed 1000in 17 seconds
parsed 2000in 30 seconds
parsed 3000in 43 seconds
parsed 4000in 56 seconds
parsed 5000in 66 seconds
parsed 6000in 77 seconds
parsed 7000in 90 seconds
parsed 8000in 101 seconds
parsed 9000in 113 seconds
SR parser took 120.506 seconds for 1000 documents to 9558 sentences

Any help would be greatly appreciated!

回答1:

If all you need to do is split a block of text into sentences, you only need the tokenize and ssplit annotators. The parser is completely superfluous. So:

props.put("annotators", "tokenize, ssplit")