I'm new to Sagemaker and am running some tests to measure the performance of NTM and LDA on AWS compared with LDA mallet and native Gensim LDA model.
I'm wanting to inspect the trained models on Sagemaker and look at stuff like what words have the highest contribution for each topic. And also to get a measure of model coherence.
I have been able to successfully get what words have the highest contribution for each topic for NTM on Sagemaker by downloading the output file untarring it and unzipping to expose 3 files params, symbol.json and meta.json.
However, when I try to do the same process for LDA, the untarred output file cannot be unzipped.
Maybe I'm missing something or should do something different for LDA compared with NTM but I have not been able to find any documentation on this. Also, anyone found a simple way to calculate model coherence?
Any assistance would be greatly appreciated!
Regarding coherence, there's no default implementation in sagemaker AFAIK.
You can implement you own metric like this:
and get the coherence for your real model like:
Some intuitive tests for coherence as follows:
Futher reading:
This SageMaker notebook, which dives into the scientific details of LDA, also demonstrates how to inspect the model artifacts. Specifically, how to obtain the estimates for the Dirichlet prior
alpha
and the topic-word distribution matrixbeta
. You can find the instructions in the section titled "Inspecting the Trained Model". For convenience, I will reproduce the relevant code here:That should get you the model data. Note that the topics, which are stored as rows of
beta
, are not presented in any particular order.