Inefficiency of topic modelling for text clusterin

2019-03-04 09:01发布

问题:

I tried doing text clustering using LDA, but it isn't giving me distinct clusters. Below is my code

#Import libraries
from gensim import corpora, models
import pandas as pd
from gensim.parsing.preprocessing import STOPWORDS
from itertools import chain

#stop words
stoplist = list(STOPWORDS)
new = ['education','certification','certificate','certified']
stoplist.extend(new)
stoplist.sort()

#read data
dat = pd.read_csv('D:\data_800k.csv',encoding='latin').Certi.tolist()
#remove stop words
texts = [[word for word in document.lower().split() if word not in stoplist] for document in dat]
#dictionary
dictionary = corpora.Dictionary(texts)
#corpus
corpus = [dictionary.doc2bow(text) for text in texts]
#train model
lda = models.LdaMulticore(corpus, id2word=dictionary, num_topics=25, workers=4,minimum_probability=0)
#print topics
lda.print_topics(num_topics=25, num_words=7)
#get corpus
lda_corpus = lda[corpus]
#calculate cutoff score
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))


#threshold
threshold = sum(scores)/len(scores)
threshold
**0.039999999971137644**

#cluster1
cluster1 = [j for i,j in zip(lda_corpus,dat) if i[0][1] > threshold]

#cluster2
cluster2 = [j for i,j in zip(lda_corpus,dat) if i[1][1] > threshold]

The problem is there are overlapping elements in cluster1, which tend to be present in cluster2 and so on.

I also tried to increase threshold manually to 0.5, however it is giving me the same issue

回答1:

That is just realistic.

Neither documents or words are usually uniquely assignable to a single cluster.

If you'd manually label some data, you will also quickly find some documents that cannot be clearly labeled as one or the other. So it's good I'd the algorithm doesn't pretend there were a good unique assignment.