clustering list of words in python

I am a newbie in text mining, here is my situation. Suppose i have a list of words ['car', 'dog', 'puppy', 'vehicle'], i would like to cluster words into k groups, I want the output to be [['car', 'vehicle'], ['dog', 'puppy']]. I first calculate similarity score of each pairwise word to obtain a 4x4 matrix(in this case) M, where Mij is the similarity score of word i and j. After transforming the words into numeric data, i utilize different clustering library(such as sklearn) or implement it by myself to get the word clusters.

I want to know does this approach makes sense? Besides, how do I determine the value of k? More importantly, i know that there exist different clustering technique, i am thinking whether i should use k-means or k-medoids for word clustering?

标签： python nlp cluster-analysis text-mining

3条回答

爷、活的狠高调

2楼-- · 2020-03-27 04:40

If you want to cluster words by their "semantic similarity" (i.e. likeness of their meaning) take a look at Word2Vec and GloVe. Gensim has an implementation for Word2Vec. This web page, "Word2Vec Tutorial", by Radim Rehurek gives a tutorial on using Word2Vec to determine similar words.

0人赞添加讨论(0) 举报

够拽才男人

3楼-- · 2020-03-27 04:47

Adding on to what's already been said regarding similarity scores, finding k in clustering applications generally is aided by scree plots (also known as an "elbow curve"). In these plots, you'll usually have some measure of dispersion between clusters on the y-axis, and the number of clusters on the x-axis. Finding the minimum (second derivative) of the curve in the scree plot gives you a more objective measure of cluster "uniqueness."

0人赞添加讨论(0) 举报

ら.Afraid

4楼-- · 2020-03-27 04:53

Following up the answer by Brian O'Donnell, once you've computed the semantic similarity with word2vec (or FastText or GLoVE, ...), you can then cluster the matrix using sklearn.clustering. I've found that for small matrices, spectral clustering gives the best results.

It's worth keeping in mind that the word vectors are often embedded on a high-dimensional sphere. K-means with a Euclidean distance matrix fails to capture this, and may lead to poor results for the similarity of words that aren't immediate neighbors.

0人赞添加讨论(0) 举报

clustering list of words in python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间