I'm using StanfordNLP in order to classify some text. It works fine when I use train files with a maximum of 160K lines. But, if I use larger ones, I receive a java.lang.OutOfMemoryError: Java heap space
.
I'm using the following properties:
e.s.n.c.ColumnDataClassifier - Setting ColumnDataClassifier properties
e.s.n.c.ColumnDataClassifier - 1.useAllSplitWordTriples = true
e.s.n.c.ColumnDataClassifier - useQN = true
e.s.n.c.ColumnDataClassifier - encoding = utf-8
e.s.n.c.ColumnDataClassifier - useClassFeature = true
e.s.n.c.ColumnDataClassifier - 1.binnedLengths = 10,20,30
e.s.n.c.ColumnDataClassifier - 1.minNGramLeng = 2
e.s.n.c.ColumnDataClassifier - lowercase = true
e.s.n.c.ColumnDataClassifier - intern = true
e.s.n.c.ColumnDataClassifier - 1.splitWordsRegexp = \s+
e.s.n.c.ColumnDataClassifier - goldAnswerColumn = 0
e.s.n.c.ColumnDataClassifier - 1.minWordNGramLeng = 2
e.s.n.c.ColumnDataClassifier - displayedColumn = 1
e.s.n.c.ColumnDataClassifier - printClassifierParam = 200
e.s.n.c.ColumnDataClassifier - 1.useNGrams = true
e.s.n.c.ColumnDataClassifier - QNsize = 5
e.s.n.c.ColumnDataClassifier - sigma = 3
e.s.n.c.ColumnDataClassifier - 1.useAllSplitWordPairs = true
e.s.n.c.ColumnDataClassifier - tolerance = 1e-4
e.s.n.c.ColumnDataClassifier - 1.usePrefixSuffixNGrams = true
e.s.n.c.ColumnDataClassifier - 1.useSplitWordNGrams = true
e.s.n.c.ColumnDataClassifier - 1.maxWordNGramLeng = 4
e.s.n.c.ColumnDataClassifier - 1.maxNGramLeng = 4
The train file details
e.s.n.c.Dataset - numDatums: 231049
numDatumsPerLabel: {84146000=1654.0, 84610000=76.0, 85164000=1991.0, 85171232=25.0, 94010000=4534.0, 85171231=32257.0, 85166000=224.0, 94031000=51.0, 84181000=5607.0, 85094050=456.0, 94035000=2530.0, 84184000=586.0, 84183000=466.0, 85094020=1502.0, 85161000=375.0, 85270000=2.0, 84151000=823.0, 85163100=1977.0, 85163200=1858.0, 84430000=1803.0, 85167920=597.0, 73211100=4963.0, 84145000=3369.0, 85171100=297.0, 84500000=1919.0, 85165000=1136.0, 99999999=123959.0, 94032000=184.0, 94030000=44.0, 85091000=1466.0, 85098000=85.0, 94034000=837.0, 94036000=2066.0, 85094010=2826.0, 85287200=10090.0, 84243010=945.0, 84186900=427.0, 85183000=1130.0, 84713010=11690.0, 84715010=1633.0, 94041000=1783.0, 85167910=806.0}
numLabels: 42 [99999999, 73211100, 84145000, 84146000, 84151000, 84181000, 84183000, 84184000, 84186900, 84243010, 84430000, 84500000, 84610000, 84713010, 84715010, 85091000, 85094010, 85094020, 85094050, 85098000, 85161000, 85163100, 85163200, 85164000, 85165000, 85166000, 85167910, 85167920, 85171100, 85171231, 85171232, 85183000, 85270000, 85287200, 94010000, 94030000, 94031000, 94032000, 94034000, 94035000, 94036000, 94041000]
numFeatures (Phi(X) types): 9434620 [CLASS, 1-SW#-fulano-firmar, 1-#-oli, 1-#-irma, 1-#B-rob, ...]
And the exception is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:891)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:856)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:850)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:93)
at edu.stanford.nlp.classify.LinearClassifierFactory.trainWeights(LinearClassifierFactory.java:529)
at edu.stanford.nlp.classify.LinearClassifierFactory.trainClassifier(LinearClassifierFactory.java:929)
at edu.stanford.nlp.classify.LinearClassifierFactory.trainClassifier(LinearClassifierFactory.java:913)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeClassifier(ColumnDataClassifier.java:1482)
at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:2087)
at com.firmar.TextClassifier.<init>(TextClassifier.java:75)
at com.firmar.App.main(App.java:27)
Line 75 from TextClassifier
is the line (cdc.trainClassifier(trainFile)
) from my code in which I'm trying to train a ColumnDataClassifier as follow:
ColumnDataClassifier cdc = new ColumnDataClassifier(propFile);
cdc.trainClassifier(trainFile);
App is just a command line program that I did in order to execute the text classifier. I'm calling it as follows:
java -Xmx10240m -jar textclassifier-1.0-jar-with-dependencies.jar ./stanford_classifier.prop ./stanford_classifier.train
So, as you can see, I'm giving to the app 10gb of run (my server has 12gb).
Since the exception is thrown at QNMinimizer, I tried to reduce QNSize to 5 (the default is 15) but the same error occurs. Is there any parameter that I can change in order to reduce memory usage, or I'll need to put more memory into the server?
UPDATE: I added more memory (now the server has 16gb, and the app is running with 14gb) and I also disabled QN (useQN=false). Same error...