Doc2Vec Sentence Clustering

2019-04-08 21:20发布

问题:

I have multiple documents that contain multiple sentences. I want to use doc2vec to cluster (e.g. k-means) the sentence vectors by using sklearn.

As such, the idea is that similar sentences are grouped together in several clusters. However, it is not clear to me if I have to train every single document separately and then use a clustering algorithm on the sentence vectors. Or, if I could infer a sentence vector from doc2vec without training every new sentence.

Right now this is a snippet of my code:

sentenceLabeled = []
for sentenceID, sentence in enumerate(example_sentences):
    sentenceL = TaggedDocument(words=sentence.split(), tags = ['SENT_%s' %sentenceID])
    sentenceLabeled.append(sentenceL)

model = Doc2Vec(size=300, window=10, min_count=0, workers=11, alpha=0.025, 
min_alpha=0.025)
model.build_vocab(sentenceLabeled)
for epoch in range(20):
    model.train(sentenceLabeled)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay
textVect = model.docvecs.doctag_syn0

## K-means ##
num_clusters = 3
km = KMeans(n_clusters=num_clusters)
km.fit(textVect)
clusters = km.labels_.tolist()

## Print Sentence Clusters ##
cluster_info = {'sentence': example_sentences, 'cluster' : clusters}
sentenceDF = pd.DataFrame(cluster_info, index=[clusters], columns = ['sentence','cluster'])

for num in range(num_clusters):
     print()
     print("Sentence cluster %d: " %int(num+1), end='')
     print()
     for sentence in sentenceDF.ix[num]['sentence'].values.tolist():
        print(' %s ' %sentence, end='')
        print()
    print()

Basically, what I am doing right now is training on every labeled sentence in the document. However, if have the idea that this could be done in a simpler way.

Eventually, the sentences that contain similar words should be clustered together and be printed. At this point training every document separately, does not clearly reveal any logic within the clusters.

Hopefully someone can steer me in the right direction. Thanks.

回答1:

  • have you looked at the word vectors you get (use DM=1 algorithm setting) ? Do these show good similarities when you inspect them?
  • I would try using tSNE to reduce down your dimensions once you have got some reasonable looking similar word vectors working. You can use PCA first to do that to reduce to say 50 or so dimensions if you need to . Think both are in sklearn. Then see if your documents are forming distinct groups or not like that.
  • also look at your most_similar() document vectors and try infer_vector() on a known trained sentence and you should get a very close similarity to 1 if all is well. (infer_vector() is always a bit different result each time, so never identical!)


回答2:

For your use case, all sentences in all documents should be trained together. Essentially, you should treat sentence as mini-document. Then they all share the same vocabulary and semantic space.