Kmeans clustering using jaccard distance matrix

2019-07-28 13:36发布

问题:

I'm trying to create Jaccard distance matrix and perform K-means on it to give out cluster ids and the ids of elements in the cluster. The input for it is twitter tweets. The following is the code and i couldn't understand how to use initial seeds from a file for kmeans.

install.packages("rjson" ,dependencies=TRUE)
library("rjson")
install.packages("jsonlite" ,dependencies=TRUE)
library("jsonlite")

install.packages("stringdist" ,dependencies=TRUE)
library("stringdist")
data <- fromJSON(sprintf("[%s]", paste(readLines(file("C:\\Users\\Yuzuru Onathoshi\\Desktop\\Assignment5_pxv142730_sxl162530\\Part2\\Input\\Tweets.json")),collapse=",")))

t.feature <- data
t.feature$geo<-NULL


Jmatrix<-stringdistmatrix(t.feature$text,t.feature$text,method = "jaccard")
colnames(Jmatrix) <- t.feature$from_user_id
rownames(Jmatrix) <- t.feature$from_user_id

fit <- kmeans(Jmatrix, 10)

回答1:

k-means does not use a distance matrix.

This is easy to see: it does not work on pairwise distances, but it only needs the deviation of a point from a center (which will usually not be a point of your data set).

It expects continuous numerical input data for clustering, and does not support arbitrary distance functions.

The core idea of k-means is minimizing variance (which is the same as minimizing squared Euclidean distances). Contrary to some tutorials and even textbooks, k-means does in fact not minimize distance (it minimizes squared distance, if your distance is Euclidean; but that may be a different minimum than the least-distance minimum). If you wanted k-means to minimize another distance, you have to find an appropriate "mean", i.e. a function that estimates the least-distance center point. Some generic substitutes have been proposed, e.g. PAM.

If you are outting a Jaccard distance matrix into k-means it will often yield a somewhat useable result, but it's not what you would expect. Rather than comparing points by Jaccard, but you cluster them by squared Euclidean of their distance vectors. It's easy to see that this values if 0 exactly if points have the same Jaccard distances to all others (including themselves), so in particular their Jaccard distance must be 0. But if your data set is unbalanced (there are some clusters with very many objects) then they will also have too much weight in this dual space.

If you need other distances (and can afford to compute a distance matrix) use hierarchical clustering (HAC) rather than k-means!