Execution time of Stanford CoreNLP on other langua

2019-08-19 10:27发布

问题:

I need to extract sentences, tokens, pos tags and lemma from English and German text of a big corpora. So, I used the Stanford CoreNLP tool. Its output is perfect. However, the problem is the time complexity. The English one executes quickly but the German model takes a long time to annotate the text. I initialize the models with these codes:

// To initialize English model
propsEN = new Properties();
propsEN.setProperty("annotators", "tokenize, ssplit, pos, lemma");
propsEN.setProperty("tokenize.language", "en");
corenlpEN = new StanfordCoreNLP(propsEN);


// To initialize German model
propsDE = new Properties();
propsDE.setProperty("annotators", "tokenize, ssplit, pos, lemma");
propsDE.setProperty("tokenize.language", "de");
corenlpDE = new StanfordCoreNLP(propsDE);

To represent the difference in execution times, I computed the length of each text and the time each model takes to run on the text. In order to calculate the execution time, I used System.currentTimeMillis() instruction:

Executing the Stanford CoreNLP model on English Text:

English text length=1587 --- Elapse time=57

English text length=15906 --- Elapse time=160

English text length=44286 --- Elapse time=3287

English text length=19814 --- Elapse time=1809

English text length=1427 --- Elapse time=166

English text length=56787 --- Elapse time=2374

Executing the Stanford CoreNLP model on German Text:

German text length=979 --- Elapse time=401

German text length=22039 --- Elapse time=15285

German text length=30632 --- Elapse time=21659

German text length=42019 --- Elapse time=21767

German text length=2944 --- Elapse time=2005

German text length=76248 --- Elapse time=48857

Why does German model take several times? Have I made any mistake? Is there any solution to solve the problem?

Any information about this topic is appreciated.

回答1:

I don't know if this will help, but you're not using the German part of speech tagger. You can set that with the pos.model property.

Here is a list of options (make sure you have the German models jar):

edu/stanford/nlp/models/pos-tagger/german/german-fast.tagger
edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger
edu/stanford/nlp/models/pos-tagger/german/german-fast-caseless.tagger
edu/stanford/nlp/models/pos-tagger/german/german-ud.tagger

Also there is no lemma for German.