I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below:
A B C
1 12 10 1
2 8 11 2
3 14 10 1
. . . .
. . . .
. . . .
in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R:
Sample_Data <- scale(Sample_Data)
output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50)
But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more important than the two other variables?
how can I insert their weights in the model?
Thank you all
You have to use a kmeans weighted clustering, like the one presented in flexclust
package:
https://cran.r-project.org/web/packages/flexclust/flexclust.pdf
The function
cclust(x, k, dist = "euclidean", method = "kmeans",
weights=NULL, control=NULL, group=NULL, simple=FALSE,
save.data=FALSE)
Perform k-means clustering, hard competitive learning or neural gas on a data matrix.
weights
An optional vector of weights to be used in the fitting process. Works only in combination with hard competitive learning.
A toy example using iris data:
library(flexclust)
data(iris)
cl <- cclust(iris[,-5], k=3, save.data=TRUE,weights =c(1,0.5,1,0.1),method="hardcl")
cl
kcca object of family ‘kmeans’
call:
cclust(x = iris[, -5], k = 3, method = "hardcl", weights = c(1, 0.5, 1, 0.1), save.data = TRUE)
cluster sizes:
1 2 3
50 59 41
As you can see from the output of cclust, also using competitive learning the family is always kmenas.
The difference is related to cluster assignment during training phase:
If method is "kmeans", the classic kmeans algorithm as given by
MacQueen (1967) is used, which works by repeatedly moving all cluster
centers to the mean of their respective Voronoi sets. If "hardcl",
on-line updates are used (AKA hard competitive learning), which work
by randomly drawing an observation from x and moving the closest
center towards that point (e.g., Ripley 1996).
The weights
parameter is just a sequence of numbers, in general I use number between 0.01 (minimum weight) and 1 (maximum weight).
If you want to increase the weight of a variable (column), just multiply it with a constant c > 1.
It's trivial to show that this increases the weight in the SSQ optimization objective.
I had the same problem and the answer here is not satisfying for me.
What we both wanted was an observation-weighted k-means clustering in R. A good readable example for our question is this link: https://towardsdatascience.com/clustering-the-us-population-observation-weighted-k-means-f4d58b370002
However the solution to use the flexclust package is not satisfying simply b/c the used algorithm is not the "standard" k-means algorithm but the "hard competitive learning" algorithm. The difference are well described above and in the package description.
I looked through many sites and did not find any solution/package in R in order to use to perform a "standard" k-means algorithm with weighted observations. I was also wondering why the flexclust package explicitly do not support weights with the standard k-means algorithm. If anyone has an explanation for this, please feel free to share!
So basically you have two options: First, rewrite the flexclust-algorithm to enable weights within the standard approach. Or second, you can estimate weighted cluster centroids as starting centroids and perform a standard k-means algorithm with only one iteration, then compute new weighted cluster centroids and perform a k-means with one iteration and so on until you reach convergence.
I used the second alternative b/c it was the easier way for me. I used the data.table package, hope you are familiar with it.
rm(list=ls())
library(data.table)
### gen dataset with sample-weights
dataset <- data.table(iris)
dataset[, weights:= rep(c(1, 0.7, 0.3, 4, 5),30)]
dataset[, Species := NULL]
### initial hclust for estimating weighted centroids
clustering <- hclust(dist(dataset[, c(1:4)], method = 'euclidean'),
method = 'ward.D2')
no_of_clusters <- 4
### estimating starting centroids (weighted)
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol = ncol(dataset[, c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
### performing weighted k-means as explained in my post
iter <- 0
cluster_i <- 0
cluster_iminus1 <- 1
## while loop: if number of iteration is smaller than 50 and cluster_i (result of
## current iteration) is not identical to cluster_iminus1 (result of former
## iteration) then continue
while(identical(cluster_i, cluster_iminus1) == F && iter < 50){
# update iteration
iter <- iter + 1
# k-means with weighted centroids and one iteration (may generate warning messages
# as no convergence is reached)
cluster_kmeans <- kmeans(x = dataset[, c(1:4)], centers = weighted_centroids, iter = 1)$cluster
# estimating new weighted centroids
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol=ncol(dataset[,c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
# update cluster_i and cluster_iminus1
if(iter == 1) {cluster_iminus1 <- 0} else{cluster_iminus1 <- cluster_i}
cluster_i <- cluster_kmeans
}
## merge final clusters to data table
dataset[, cluster := cluster_i]