I'm trying to get perplexity and log likelihood of a Spark LDA model (with Spark 2.1). The code below does not work (methods logLikelihood
and logPerplexity
not found) although I can save the model.
from pyspark.mllib.clustering import LDA
from pyspark.mllib.linalg import Vectors
# construct corpus
# run LDA
ldaModel = LDA.train(corpus, k=10, maxIterations=10)
logll = ldaModel.logLikelihood(corpus)
perplexity = ldaModel.logPerplexity(corpus)
Notice that such methods do not come up with dir(LDA)
.
What would be a working example?
I can do train but not fit. 'LDA' object has no attribute 'fit'
That's because you are working with the old, RDD-based API (MLlib), i.e.
from pyspark.mllib.clustering import LDA # WRONG import
whose LDA
class indeed does not include fit
, logLikelihood
, or logPerplexity
methods.
In order to work with these methods, you should switch to the new, dataframe-based API (ML):
from pyspark.ml.clustering import LDA # NOTE: different import
# Loads data.
dataset = (spark.read.format("libsvm")
.load("data/mllib/sample_lda_libsvm_data.txt"))
# Trains a LDA model.
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)
ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)