Spark MLlib LDA, how to infer the topics distribut

2020-01-25 07:32发布

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.

标签： apache-spark lda apache-spark-mllib topic-modeling

1条回答

霸刀☆藐视天下

2楼-- · 2020-01-25 08:01

As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this:

newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)

This is going to be less accurate than the EM algorithm that this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al., optimization of the alphas is pretty important for obtaining good topics.

0人赞添加讨论(0) 举报

Spark MLlib LDA, how to infer the topics distribut

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间