I have a collection of alerts and I want to group it based on similarity/distance. As we have non-numeric data, How can i perform clustering for this kind of data.
set.seed(42)
data.frame(Host1 = rep("del",10),
Host2 = c(rep("cpp",4), rep("sscp",3), rep("portal",3)),
Host3 = c(rep("web",5), rep("apache",3), rep("app",2)),
Host4 = c(sample(3,8, replace = TRUE), rep("con",2)),
Date1 = abs(round(1:10 + rnorm(10),2)))
Host1 Host2 Host3 Host4 Date1
1 del cpp web 3 1.40
2 del cpp web 3 1.89
3 del cpp web 1 4.51
4 del cpp web 3 3.91
5 del sscp web 2 7.02
6 del sscp apache 2 5.94
7 del sscp apache 3 8.30
8 del portal apache 1 10.29
9 del portal app con 7.61
10 del portal app con 9.72
Looking forward to build clusters.
K-means only works for numerical (continuous) data
By definition, it minimizes squared deviations. Minimizing squared deviations only make sense on continuous data. Any kind of one-hot-encoding is only a hack; it makes the data types compatible, but not the approach sensible.
What is your similarity / distance?
Hierarchical clustering would work. If you can define a meaningful distance function that quantifies distance. But this is application dependant. We do not have your data, and do not understand your problem. We cannot solve this for you.