可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm new to Sagemaker and am running some tests to measure the performance of NTM and LDA on AWS compared with LDA mallet and native Gensim LDA model.

I'm wanting to inspect the trained models on Sagemaker and look at stuff like what words have the highest contribution for each topic. And also to get a measure of model coherence.

I have been able to successfully get what words have the highest contribution for each topic for NTM on Sagemaker by downloading the output file untarring it and unzipping to expose 3 files params, symbol.json and meta.json.

However, when I try to do the same process for LDA, the untarred output file cannot be unzipped.

Maybe I'm missing something or should do something different for LDA compared with NTM but I have not been able to find any documentation on this. Also, anyone found a simple way to calculate model coherence?

Any assistance would be greatly appreciated!

回答1:

This SageMaker notebook, which dives into the scientific details of LDA, also demonstrates how to inspect the model artifacts. Specifically, how to obtain the estimates for the Dirichlet prior alpha and the topic-word distribution matrix beta. You can find the instructions in the section titled "Inspecting the Trained Model". For convenience, I will reproduce the relevant code here:

import tarfile
import mxnet as mx

# extract the tarball
tarflie_fname = FILENAME_PREFIX + 'model.tar.gz' # wherever the tarball is located
with tarfile.open(tarfile_fname) as tar:
    tar.extractall()

# obtain the model file (should be the only file starting with "model_")
model_list = [
    fname
    for fname in os.listdir(FILENAME_PREFIX)
    if fname.startswith('model_')
]
model_fname = model_list[0]

# load the contents of the model file into MXNet arrays
alpha, beta = mx.ndarray.load(model_fname)

That should get you the model data. Note that the topics, which are stored as rows of beta, are not presented in any particular order.

回答2:

Regarding coherence, there's no default implementation in sagemaker AFAIK.

You can implement you own metric like this:

from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

def calculate_coherence(topic_vectors):
    similarity_sum = 0.0
    num_combinations = 0
    for pair in combinations(topic_vectors, 2):
        similarity = cosine_similarity([pair[0]], [pair[1]])
        similarity_sum = similarity_sum + similarity
        num_combinations = num_combinations + 1
    return float(similarity_sum / num_combinations)

and get the coherence for your real model like:

print(calculate_coherence(beta.asnumpy()))

Some intuitive tests for coherence as follows:

predictions = [[0.0, 0.0, 0.0],
               [0.0, 1.0, 0.0],
               [0.0, 0.0, 1.0],
               [1.0, 0.0, 0.0]]

assert calculate_coherence(predictions) == 0.0, "Expected incoherent"

predictions = [[0.0, 1.0, 1.0],
               [0.0, 1.0, 1.0],
               [0.0, 1.0, 1.0],
               [0.0, 1.0, 1.0]]

assert calculate_coherence(predictions) == 1.0, "Expected coherent"

predictions = [[0.0, 0.0, 1.0],
               [0.0, 0.0, 1.0],
               [1.0, 0.0, 0.0],
               [1.0, 0.0, 0.0],
               [0.0, 1.0, 0.0],
               [0.0, 1.0, 0.0]]
assert calculate_coherence(predictions) == 0.2, "Expected partially coherent"

Futher reading: