How to get topic vector of new documents and compa

2019-02-25 11:04发布

I'm trying to somehow compare a sole document's topic distribution (using LDA) with, other files and their topic distributions within a previously created topic model, using MALLET.

I know that this can be done through MALLET commands in terminal but I'm having problems in finding a way to implement this in Java.

To give a gist of what the functionality of my program is:

The already created topic model was created with a large corpus of texts. I want to use this to compare topic distributions with a tweet that contains a certain hashtag and to then pull out the file most similar to the tweet from the corpus.

Ive read through Mallet's Java API docs but they seem very confusing and not really explanatory.

If anyone could give me a few tips I'd appreciate it

标签： java lda mallet

1条回答

不美不萌又怎样

2楼-- · 2019-02-25 11:40

First, take a look at these:

Developer's guide
Tutorial slides after slide 97
Code examples in the source directory: src/cc/mallet/examples

Now, these examples show the basic functionality, but they don't show how to save and load the model if you need to separate training from testing. Basically what you need is to save both the model and the instances after training (since you need to train and test with the same pipeline), and load them before testing.

Save model and pipeline after training:

model.write(new File("model.dat"));
instances.save(new File("pipeline.dat"));

Load model and pipeline before testing:

ParallelTopicModel model = ParallelTopicModel.read(new File("model.dat"));
InstanceList instances = InstanceList.load(new File("pipeline.dat"));

Hope this helps.

0人赞添加讨论(0) 举报

How to get topic vector of new documents and compa

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间