I'm using Mallet through Java, and I can't work out how to evaluate new documents against an existing topic model which I have trained.
My initial code to generate my model is very similar to that in the Mallett Developers Guide for Topic Modelling, after which I simply save the model as a Java object. In a later process, I reload that Java object from file, add new instances via .addInstances()
and would then like to evaluate only these new instances against the topics found in the original training set.
This stats.SE thread provides some high-level suggestions, but I can't see how to work them into the Mallet framework.
Any help much appreciated.
Inference is actually also listed in the example link provided in the question (the last few lines).
For anyone interested in the whole code for saving/loading the trained model and then using it for inferring model distribution for new documents - here are some snippets:
After
model.estimate()
has completed, you have the actual trained model so you can serialize it using a standard JavaObjectOutputStream
(sinceParallelTopicModel
implementsSerializable
):Note though, when you infer, you need also to pass the new sentences (as
Instance
) through the same pipeline in order to pre-process it (tokenzie etc) thus, you need to also save the pipe-list (since we're usingSerialPipe
when can create an instance and then serialize it):In order to load the model/pipeline and use them for inference we need to de-serialize:
For some reason I am not getting the exact same inference with the loaded model as with the original one - but this is a matter for another question (if anyone knows though, I'd be happy to hear)
And I've found the answer hidden in a slide-deck from Mallet's lead developer: