我为了一些文本分类使用StanfordNLP。 当我用火车文件最多的160K线,它工作正常。 但是,如果我用较大的,我收到java.lang.OutOfMemoryError: Java heap space
。
我使用以下属性:
e.s.n.c.ColumnDataClassifier - Setting ColumnDataClassifier properties
e.s.n.c.ColumnDataClassifier - 1.useAllSplitWordTriples = true
e.s.n.c.ColumnDataClassifier - useQN = true
e.s.n.c.ColumnDataClassifier - encoding = utf-8
e.s.n.c.ColumnDataClassifier - useClassFeature = true
e.s.n.c.ColumnDataClassifier - 1.binnedLengths = 10,20,30
e.s.n.c.ColumnDataClassifier - 1.minNGramLeng = 2
e.s.n.c.ColumnDataClassifier - lowercase = true
e.s.n.c.ColumnDataClassifier - intern = true
e.s.n.c.ColumnDataClassifier - 1.splitWordsRegexp = \s+
e.s.n.c.ColumnDataClassifier - goldAnswerColumn = 0
e.s.n.c.ColumnDataClassifier - 1.minWordNGramLeng = 2
e.s.n.c.ColumnDataClassifier - displayedColumn = 1
e.s.n.c.ColumnDataClassifier - printClassifierParam = 200
e.s.n.c.ColumnDataClassifier - 1.useNGrams = true
e.s.n.c.ColumnDataClassifier - QNsize = 5
e.s.n.c.ColumnDataClassifier - sigma = 3
e.s.n.c.ColumnDataClassifier - 1.useAllSplitWordPairs = true
e.s.n.c.ColumnDataClassifier - tolerance = 1e-4
e.s.n.c.ColumnDataClassifier - 1.usePrefixSuffixNGrams = true
e.s.n.c.ColumnDataClassifier - 1.useSplitWordNGrams = true
e.s.n.c.ColumnDataClassifier - 1.maxWordNGramLeng = 4
e.s.n.c.ColumnDataClassifier - 1.maxNGramLeng = 4
火车文件的详细信息
e.s.n.c.Dataset - numDatums: 231049
numDatumsPerLabel: {84146000=1654.0, 84610000=76.0, 85164000=1991.0, 85171232=25.0, 94010000=4534.0, 85171231=32257.0, 85166000=224.0, 94031000=51.0, 84181000=5607.0, 85094050=456.0, 94035000=2530.0, 84184000=586.0, 84183000=466.0, 85094020=1502.0, 85161000=375.0, 85270000=2.0, 84151000=823.0, 85163100=1977.0, 85163200=1858.0, 84430000=1803.0, 85167920=597.0, 73211100=4963.0, 84145000=3369.0, 85171100=297.0, 84500000=1919.0, 85165000=1136.0, 99999999=123959.0, 94032000=184.0, 94030000=44.0, 85091000=1466.0, 85098000=85.0, 94034000=837.0, 94036000=2066.0, 85094010=2826.0, 85287200=10090.0, 84243010=945.0, 84186900=427.0, 85183000=1130.0, 84713010=11690.0, 84715010=1633.0, 94041000=1783.0, 85167910=806.0}
numLabels: 42 [99999999, 73211100, 84145000, 84146000, 84151000, 84181000, 84183000, 84184000, 84186900, 84243010, 84430000, 84500000, 84610000, 84713010, 84715010, 85091000, 85094010, 85094020, 85094050, 85098000, 85161000, 85163100, 85163200, 85164000, 85165000, 85166000, 85167910, 85167920, 85171100, 85171231, 85171232, 85183000, 85270000, 85287200, 94010000, 94030000, 94031000, 94032000, 94034000, 94035000, 94036000, 94041000]
numFeatures (Phi(X) types): 9434620 [CLASS, 1-SW#-fulano-firmar, 1-#-oli, 1-#-irma, 1-#B-rob, ...]
而例外的是:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:891)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:856)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:850)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:93)
at edu.stanford.nlp.classify.LinearClassifierFactory.trainWeights(LinearClassifierFactory.java:529)
at edu.stanford.nlp.classify.LinearClassifierFactory.trainClassifier(LinearClassifierFactory.java:929)
at edu.stanford.nlp.classify.LinearClassifierFactory.trainClassifier(LinearClassifierFactory.java:913)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeClassifier(ColumnDataClassifier.java:1482)
at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:2087)
at com.firmar.TextClassifier.<init>(TextClassifier.java:75)
at com.firmar.App.main(App.java:27)
从75号线TextClassifier
是行( cdc.trainClassifier(trainFile)
从我的代码中,我试图培养ColumnDataClassifier如下):
ColumnDataClassifier cdc = new ColumnDataClassifier(propFile);
cdc.trainClassifier(trainFile);
应用程序仅仅是一个,我为了执行文本分类做了命令行程序。 我打电话,如下所示:
java -Xmx10240m -jar textclassifier-1.0-jar-with-dependencies.jar ./stanford_classifier.prop ./stanford_classifier.train
所以,你可以看到,我给要运行的应用程序10GB(我的服务器有12GB)。
由于异常的QNMinimizer抛出,我试图QNSize减少到5(默认为15),但会发生同样的错误。 有没有办法,我可以为了改变以减少内存使用量,或者我需要把更多的内存到服务器的任何参数?
更新:我增加了更多的内存(现在的服务器有16GB,和应用程序与14GB运行),我也被禁止QN(useQN = FALSE)。 同样的错误......