Mallet topic modelling

2019-04-28 04:48发布

I have been using mallet for inferring topics for a text file containing 100,000 lines(around 34 MB in mallet format). But now i need to run it for on a file containing a million lines(around 180MB) and I am getting an java.lang.outofmemory exception . Is there a way of splitting the file into smaller ones and build a model for the data present in all the files combined?? thanks in advance

标签： java nlp machine-learning mallet

5条回答

beautiful°

2楼-- · 2019-04-28 05:06

java.lang.outofmemory exception occurs mainly because of insufficient heap space. You can use -Xms and -Xmx to set heap space so that it will not come again.

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2019-04-28 05:13

I'm not sure about scalability of Mallet to big data, but project http://dragon.ischool.drexel.edu/ can store its data in disk backed persistence therefore can scale to unlimited corpus sizes(with low performance of course)

0人赞添加讨论(0) 举报

Deceive 欺骗

4楼-- · 2019-04-28 05:17

Given the current PC's memory size, it should be easy to use a heap as large as 2GB. You should try the single-machine solution before considering using a cluster.

0人赞添加讨论(0) 举报

老娘就宠你

5楼-- · 2019-04-28 05:21

The model is still going to be pretty much huge, even if it read it from multiple files. Have you tried increasing the heap size of your java vm?

0人赞添加讨论(0) 举报

何必那么认真

6楼-- · 2019-04-28 05:24

In bin/mallet.bat increase value for this line:

set MALLET_MEMORY=1G

0人赞添加讨论(0) 举报

Mallet topic modelling

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间