Relabel samples in kmean results considering the o

2019-07-11 07:36发布

问题:

I am using kmeans to cluster my data, for the produced result I have a plan.

I wanted to relabel the samples based on ordered centres. Consider following example :

a = c("a","b","c","d","e","F","i","j","k","l","m","n")
b = c(1,2,3,20,21,21,40,41,42,4,23,50)

mydata = data.frame(id=a,amount=b)
result = kmeans(mydata$amount,3,nstart=10)

Here is the result :

clus$cluster 
2 2 2 3 3 3 1 1 1 2 3 1

clus$centers
1 43.25
2  2.50
3 21.25


mydata = data.frame(mydata,label =clus$cluster)
mydata
    id amount  label
1   a      1        2
2   b      2        2
3   c      3        2
4   d     20        3
5   e     21        3
6   F     21        3
7   i     40        1
8   j     41        1
9   k     42        1
10  l      4        2
11  m     23        3
12  n     50        1

What I am looking for is sorting the centres and producing the labels accordingly:

1  2.50
2  21.25
3  43.25

and label the samples going to:

1 1 1 2 2 2 3 3 3 1 2 3 

and the result should be :

    id amount  label
1   a      1        1
2   b      2        1
3   c      3        1
4   d     20        2
5   e     21        2
6   F     21        2
7   i     40        3
8   j     41        3
9   k     42        3
10  l      4        1
11  m     23        2
12  n     50        3

I think it is possible to do it by, order the centres and for each sample taking the index of minimum distance of samples with centres as the label of that cluster.

Is there another way that R can do it automatically ?

回答1:

One idea is to create a named vector by matching your centers with the sorted centers. Then match the vector with mydata$label and replace with the names of the vector, i.e.

i1 <- setNames(match(sort(result$centers), result$centers), rownames(result$centers))

as.numeric(names(i1)[match(mydata$label, i1)])
# [1] 1 1 1 2 2 2 3 3 3 1 2 3


回答2:

You can use for loop, if you don't mind loops

cls <- result$cluster
for (i in 1 : length(result$cluster)) 
     result$cluster[cls == order(result$centers)[i]] <- i

result$cluster
#[1] 1 1 1 2 2 2 3 3 3 1 2 3