I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below:
A B C
1 12 10 1
2 8 11 2
3 14 10 1
. . . .
. . . .
. . . .
in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R:
Sample_Data <- scale(Sample_Data)
output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50)
But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more important than the two other variables? how can I insert their weights in the model? Thank you all
I had the same problem and the answer here is not satisfying for me.
What we both wanted was an observation-weighted k-means clustering in R. A good readable example for our question is this link: https://towardsdatascience.com/clustering-the-us-population-observation-weighted-k-means-f4d58b370002
However the solution to use the flexclust package is not satisfying simply b/c the used algorithm is not the "standard" k-means algorithm but the "hard competitive learning" algorithm. The difference are well described above and in the package description.
I looked through many sites and did not find any solution/package in R in order to use to perform a "standard" k-means algorithm with weighted observations. I was also wondering why the flexclust package explicitly do not support weights with the standard k-means algorithm. If anyone has an explanation for this, please feel free to share!
So basically you have two options: First, rewrite the flexclust-algorithm to enable weights within the standard approach. Or second, you can estimate weighted cluster centroids as starting centroids and perform a standard k-means algorithm with only one iteration, then compute new weighted cluster centroids and perform a k-means with one iteration and so on until you reach convergence.
I used the second alternative b/c it was the easier way for me. I used the data.table package, hope you are familiar with it.
If you want to increase the weight of a variable (column), just multiply it with a constant c > 1.
It's trivial to show that this increases the weight in the SSQ optimization objective.
You have to use a kmeans weighted clustering, like the one presented in
flexclust
package:https://cran.r-project.org/web/packages/flexclust/flexclust.pdf
The function
Perform k-means clustering, hard competitive learning or neural gas on a data matrix.
weights
An optional vector of weights to be used in the fitting process. Works only in combination with hard competitive learning.A toy example using iris data:
As you can see from the output of cclust, also using competitive learning the family is always kmenas. The difference is related to cluster assignment during training phase:
The
weights
parameter is just a sequence of numbers, in general I use number between 0.01 (minimum weight) and 1 (maximum weight).