I have a set of documents that I am trying to cluster based on their vocabulary (that is, first making a corpus and then a sparse matrix with the DocumentTermMatrix
command and so on). To improve the clusters and to understand better what features/words make a particular document fall into a particular cluster, I would like to know what the most distinguishing features for each cluster are.
There is an example of this in the Machine Learning with R book by Lantz, if you happen to know it - he clusters teen social media profiles by the interests they have pegged, and ends up with a table like this that shows "each cluster ... with the features that most distinguish it from the other clusters":
cluster 1 | cluster 2 | cluster 3 ....
swimming | band | sports ...
dance | music | kissed ....
Now, my features aren't quite as informative, but I'd still like to be able to build something like that.
However, the book does not explain how the table was constructed. I have tried my best to google creatively, and perhaps the answer is some obvious calculation on the cluster means, but being a newbie to R as well as to statistics, I could not figure it out. Any help is much appreciated, including links to previous questions or other resources I may have missed!
Thanks.
I had a similar problem some time ago..
Here is what I did:
A small example:
HTH