I need to extract sentences, tokens, pos tags and lemma from English and German text of a big corpora. So, I used the Stanford CoreNLP tool. Its output is perfect. However, the problem is the time complexity. The English one executes quickly but the German model takes a long time to annotate the text. I initialize the models with these codes:
// To initialize English model
propsEN = new Properties();
propsEN.setProperty("annotators", "tokenize, ssplit, pos, lemma");
propsEN.setProperty("tokenize.language", "en");
corenlpEN = new StanfordCoreNLP(propsEN);
// To initialize German model
propsDE = new Properties();
propsDE.setProperty("annotators", "tokenize, ssplit, pos, lemma");
propsDE.setProperty("tokenize.language", "de");
corenlpDE = new StanfordCoreNLP(propsDE);
To represent the difference in execution times, I computed the length of each text and the time each model takes to run on the text. In order to calculate the execution time, I used System.currentTimeMillis() instruction:
Executing the Stanford CoreNLP model on English Text:
English text length=1587 --- Elapse time=57
English text length=15906 --- Elapse time=160
English text length=44286 --- Elapse time=3287
English text length=19814 --- Elapse time=1809
English text length=1427 --- Elapse time=166
English text length=56787 --- Elapse time=2374
Executing the Stanford CoreNLP model on German Text:
German text length=979 --- Elapse time=401
German text length=22039 --- Elapse time=15285
German text length=30632 --- Elapse time=21659
German text length=42019 --- Elapse time=21767
German text length=2944 --- Elapse time=2005
German text length=76248 --- Elapse time=48857
Why does German model take several times? Have I made any mistake? Is there any solution to solve the problem?
Any information about this topic is appreciated.