Clustering for Categorical and Numerical data

2019-07-18 06:25发布

I have a collection of alerts and I want to group it based on similarity/distance. As we have non-numeric data, How can i perform clustering for this kind of data.

  set.seed(42)   
  data.frame(Host1 = rep("del",10), 
  Host2 = c(rep("cpp",4), rep("sscp",3), rep("portal",3)),
 Host3 = c(rep("web",5), rep("apache",3), rep("app",2)), 
 Host4 = c(sample(3,8, replace = TRUE), rep("con",2)), 
 Date1 = abs(round(1:10 + rnorm(10),2))) 



   Host1  Host2  Host3 Host4 Date1
1    del    cpp    web     3  1.40
2    del    cpp    web     3  1.89
3    del    cpp    web     1  4.51
4    del    cpp    web     3  3.91
5    del   sscp    web     2  7.02
6    del   sscp apache     2  5.94
7    del   sscp apache     3  8.30
8    del portal apache     1 10.29
9    del portal    app   con  7.61
10   del portal    app   con  9.72

Looking forward to build clusters.

1条回答
仙女界的扛把子
2楼-- · 2019-07-18 07:16

K-means only works for numerical (continuous) data

By definition, it minimizes squared deviations. Minimizing squared deviations only make sense on continuous data. Any kind of one-hot-encoding is only a hack; it makes the data types compatible, but not the approach sensible.

What is your similarity / distance?

Hierarchical clustering would work. If you can define a meaningful distance function that quantifies distance. But this is application dependant. We do not have your data, and do not understand your problem. We cannot solve this for you.

查看更多
登录 后发表回答