Using topic modeling Java toolkit

2019-09-06 06:39发布

I'm working on text classification and I want to use Topic models (LDA). My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news.

I saw two Java toolkits: mallet and lingpipe. I've read mallet tutorial on importing the data and it gets data in plain text, not the format that I have. is there any way that I could change it?

Also read a little about the lingpipe, the example from tutorial was using arrays of integers. Is it convenient for large data?

I need to know which implementation of LDA is better for me? Are there any other implementation that suits my data? (in Java)

1条回答
迷人小祖宗
2楼-- · 2019-09-06 06:49

From the keyword-weight file you can create an artificial text containing the words in random order with the given weights. Run mallet on the so-generated texts to retrieve the topics.

查看更多
登录 后发表回答