R - cluster analysis on binary weblog data

2019-05-26 18:19发布

问题:

I have a web data that looks similar to the sample below. It simply has the user and binary value for whether that user cliked on a particular link within a website. I wanted to do some clustering of this data. My main goal is to find similar users based on their online behaviour. What is a good clustering alorithm for this? I have tried k-means which does not work well with binary data. I have also tried spherical k-means skmeans(). I wanted to do a sum of squared error scree plot, but I could not figure out how to get SSE from skmeans.

   User   link1 link2 link3 link4
    abc1     0     1     1     1
    abc2     1     0     1     0
    abc3     0     1     1     1
    abc4     1     0     1     0

回答1:

You could try a hierarchical clustering using a binary distance measure like jaccard, if "clicked a link" is asymmetrical:

dat <- read.table(header = TRUE, row.names = 1, text = "User   link1 link2 link3 link4
abc1     0     1     1     1
abc2     1     0     1     0
abc3     0     1     1     1
abc4     1     0     1     0")
d <- dist(dat, method = "binary")
hc <- hclust(d)
plot(hc)

(clusters <- cutree(hc, k = 2))
# abc1 abc2 abc3 abc4 
#    1    2    1    2