I have been using mallet for inferring topics for a text file containing 100,000 lines(around 34 MB in mallet format). But now i need to run it for on a file containing a million lines(around 180MB) and I am getting an java.lang.outofmemory exception . Is there a way of splitting the file into smaller ones and build a model for the data present in all the files combined?? thanks in advance
相关问题
- Delete Messages from a Topic in Apache Kafka
- Jackson Deserialization not calling deserialize on
- How to maintain order of key-value in DataFrame sa
- StackExchange API - Deserialize Date in JSON Respo
- Difference between Types.INTEGER and Types.NULL in
java.lang.outofmemory exception occurs mainly because of insufficient heap space. You can use -Xms and -Xmx to set heap space so that it will not come again.
I'm not sure about scalability of Mallet to big data, but project http://dragon.ischool.drexel.edu/ can store its data in disk backed persistence therefore can scale to unlimited corpus sizes(with low performance of course)
Given the current PC's memory size, it should be easy to use a heap as large as 2GB. You should try the single-machine solution before considering using a cluster.
The model is still going to be pretty much huge, even if it read it from multiple files. Have you tried increasing the heap size of your java vm?
In bin/mallet.bat increase value for this line: