I'm new to Sagemaker and am running some tests to measure the performance of NTM and LDA on AWS compared with LDA mallet and native Gensim LDA model.
I'm wanting to inspect the trained models on Sagemaker and look at stuff like what words have the highest contribution for each topic. And also to get a measure of model coherence.
I have been able to successfully get what words have the highest contribution for each topic for NTM on Sagemaker by downloading the output file untarring it and unzipping to expose 3 files params, symbol.json and meta.json.
However, when I try to do the same process for LDA, the untarred output file cannot be unzipped.
Maybe I'm missing something or should do something different for LDA compared with NTM but I have not been able to find any documentation on this. Also, anyone found a simple way to calculate model coherence?
Any assistance would be greatly appreciated!
This SageMaker notebook, which dives into the scientific details of LDA, also demonstrates how to inspect the model artifacts. Specifically, how to obtain the estimates for the Dirichlet prior alpha
and the topic-word distribution matrix beta
. You can find the instructions in the section titled "Inspecting the Trained Model". For convenience, I will reproduce the relevant code here:
import tarfile
import mxnet as mx
# extract the tarball
tarflie_fname = FILENAME_PREFIX + 'model.tar.gz' # wherever the tarball is located
with tarfile.open(tarfile_fname) as tar:
tar.extractall()
# obtain the model file (should be the only file starting with "model_")
model_list = [
fname
for fname in os.listdir(FILENAME_PREFIX)
if fname.startswith('model_')
]
model_fname = model_list[0]
# load the contents of the model file into MXNet arrays
alpha, beta = mx.ndarray.load(model_fname)
That should get you the model data. Note that the topics, which are stored as rows of beta
, are not presented in any particular order.
Regarding coherence, there's no default implementation in sagemaker AFAIK.
You can implement you own metric like this:
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity
def calculate_coherence(topic_vectors):
similarity_sum = 0.0
num_combinations = 0
for pair in combinations(topic_vectors, 2):
similarity = cosine_similarity([pair[0]], [pair[1]])
similarity_sum = similarity_sum + similarity
num_combinations = num_combinations + 1
return float(similarity_sum / num_combinations)
and get the coherence for your real model like:
print(calculate_coherence(beta.asnumpy()))
Some intuitive tests for coherence as follows:
predictions = [[0.0, 0.0, 0.0],
[0.0, 1.0, 0.0],
[0.0, 0.0, 1.0],
[1.0, 0.0, 0.0]]
assert calculate_coherence(predictions) == 0.0, "Expected incoherent"
predictions = [[0.0, 1.0, 1.0],
[0.0, 1.0, 1.0],
[0.0, 1.0, 1.0],
[0.0, 1.0, 1.0]]
assert calculate_coherence(predictions) == 1.0, "Expected coherent"
predictions = [[0.0, 0.0, 1.0],
[0.0, 0.0, 1.0],
[1.0, 0.0, 0.0],
[1.0, 0.0, 0.0],
[0.0, 1.0, 0.0],
[0.0, 1.0, 0.0]]
assert calculate_coherence(predictions) == 0.2, "Expected partially coherent"
Futher reading:
- http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf