Clustering: how to extract most distinguishing fea

I have a set of documents that I am trying to cluster based on their vocabulary (that is, first making a corpus and then a sparse matrix with the DocumentTermMatrix command and so on). To improve the clusters and to understand better what features/words make a particular document fall into a particular cluster, I would like to know what the most distinguishing features for each cluster are.

There is an example of this in the Machine Learning with R book by Lantz, if you happen to know it - he clusters teen social media profiles by the interests they have pegged, and ends up with a table like this that shows "each cluster ... with the features that most distinguish it from the other clusters":

cluster 1  | cluster 2 | cluster 3 ....
swimming   | band      | sports  ... 
dance      | music     | kissed ....

Now, my features aren't quite as informative, but I'd still like to be able to build something like that.
However, the book does not explain how the table was constructed. I have tried my best to google creatively, and perhaps the answer is some obvious calculation on the cluster means, but being a newbie to R as well as to statistics, I could not figure it out. Any help is much appreciated, including links to previous questions or other resources I may have missed!

Thanks.

标签： r cluster-analysis text-mining

1条回答

在下西门庆

2楼-- · 2019-02-21 05:10

I had a similar problem some time ago..

Here is what I did:

require("tm")
require("skmeans")
require("slam")

# clus: a skmeans object
# dtm: a Document Term Matrix
# first: eg. 10 most frequent words per cluster
# unique: if FALSE all words of the DTM will be used
#         if TRUE only cluster specific words will be used 



# result: List with words and frequency of words 
#         If unique = TRUE, only cluster specific words will be considered.
#         Words which occur in more than one cluster will be ignored.



mfrq_words_per_cluster <- function(clus, dtm, first = 10, unique = TRUE){
  if(!any(class(clus) == "skmeans")) return("clus must be an skmeans object")

  dtm <- as.simple_triplet_matrix(dtm)
  indM <- table(names(clus$cluster), clus$cluster) == 1 # generate bool matrix

  hfun <- function(ind, dtm){ # help function, summing up words
    if(is.null(dtm[ind, ]))  dtm[ind, ] else  col_sums(dtm[ind, ])
  }
  frqM <- apply(indM, 2, hfun, dtm = dtm)

  if(unique){
    # eliminate word which occur in several clusters
    frqM <- frqM[rowSums(frqM > 0) == 1, ] 
  }
  # export to list, order and take first x elements 
  res <- lapply(1:ncol(frqM), function(i, mat, first)
                head(sort(mat[, i], decreasing = TRUE), first),
                mat = frqM, first = first)

  names(res) <- paste0("CLUSTER_", 1:ncol(frqM))
  return(res)
}

A small example:

data("crude")
dtm <- DocumentTermMatrix(crude, control =
                          list(removePunctuation = TRUE,
                               removeNumbers = TRUE,
                               stopwords = TRUE))

rownames(dtm) <- paste0("Doc_", 1:20)
clus <- skmeans(dtm, 3)


mfrq_words_per_cluster(clus, dtm)
mfrq_words_per_cluster(clus, dtm, unique = FALSE)

HTH

0人赞添加讨论(0) 举报

Clustering: how to extract most distinguishing fea

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间