Mallet topic modelling

2019-04-28 04:48发布

I have been using mallet for inferring topics for a text file containing 100,000 lines(around 34 MB in mallet format). But now i need to run it for on a file containing a million lines(around 180MB) and I am getting an java.lang.outofmemory exception . Is there a way of splitting the file into smaller ones and build a model for the data present in all the files combined?? thanks in advance

5条回答
beautiful°
2楼-- · 2019-04-28 05:06

java.lang.outofmemory exception occurs mainly because of insufficient heap space. You can use -Xms and -Xmx to set heap space so that it will not come again.

查看更多
迷人小祖宗
3楼-- · 2019-04-28 05:13

I'm not sure about scalability of Mallet to big data, but project http://dragon.ischool.drexel.edu/ can store its data in disk backed persistence therefore can scale to unlimited corpus sizes(with low performance of course)

查看更多
Deceive 欺骗
4楼-- · 2019-04-28 05:17

Given the current PC's memory size, it should be easy to use a heap as large as 2GB. You should try the single-machine solution before considering using a cluster.

查看更多
老娘就宠你
5楼-- · 2019-04-28 05:21

The model is still going to be pretty much huge, even if it read it from multiple files. Have you tried increasing the heap size of your java vm?

查看更多
何必那么认真
6楼-- · 2019-04-28 05:24

In bin/mallet.bat increase value for this line:

set MALLET_MEMORY=1G
查看更多
登录 后发表回答