reordering cluster numbers for correct corresponde

I have a dataset that I clustered using two different clustering algorithms. The results are about the same, but the cluster numbers are permuted. Now for displaying the color coded labels, I want the label ids to be same for the same clusters. How can I get correct permutation between the two label ids?

I can do this using brute force, but perhaps there is a better/faster method. I would greatly appreciate any help or pointers. If possible I am looking for a python function.

标签： python python-2.7 cluster-analysis permutation

2条回答

够拽才男人

2楼-- · 2020-02-14 08:45

The most well-known algorithm for finding the optimum matching is the hungarian method.

Because it cannot be explained in a few sentences, I have to refer you to a book of your choice, or Wikipedia article "Hungarian algorithm".

You can probably get good results (even perfect if the difference is indeed tiny) by simply picking the maximum of the correspondence matrix and then removing that row and column.

0人赞添加讨论(0) 举报

疯言疯语

3楼-- · 2020-02-14 08:54

I have a function that works for me. But it may fail when the two cluster results are very inconsistent, which leads to duplicated max values in the contingency matrix. If your cluster results are about the same, it should work.

Here is my code:

from sklearn.metrics.cluster import contingency_matrix

def align_cluster_index(ref_cluster, map_cluster):
"""
remap cluster index according the the ref_cluster.
both inputs must be nparray and have same number of unique cluster index values.

Xin Niu Jan-15-2020
"""

ref_values = np.unique(ref_cluster)
map_values = np.unique(map_cluster)

print(ref_values)
print(map_values)

num_values = ref_values.shape[0]

if ref_values.shape[0]!=map_values.shape[0]:
    print('error: both inputs must have same number of unique cluster index values.')
    return()

switched_col = set()
while True:
    cont_mat = contingency_matrix(ref_cluster, map_cluster)
    print(cont_mat)
    # divide contingency_matrix by its row and col sums to avoid potential duplicated values:
    col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values))
    row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values)))
    print(col_sum)
    print(row_sum)

    cont_mat = cont_mat/(col_sum+row_sum)
    print(cont_mat)

    # ignore columns that have been switched:
    cont_mat[:, list(switched_col)]=-1

    print(cont_mat)

    sort_0 = np.argsort(cont_mat, axis = 0)
    sort_1 = np.argsort(cont_mat, axis = 1)

    print('argsort contmat:')
    print(sort_0)
    print(sort_1)

    if np.array_equal(sort_1[:,-1], np.array(range(num_values))):
        break

    # switch values according to the max value in the contingency matrix:
    # get the position of max value:
    idx_max = np.unravel_index(np.argmax(cont_mat, axis=None), cont_mat.shape)
    print(cont_mat)
    print(idx_max)

    if (cont_mat[idx_max]>0) and (idx_max[0] not in switched_col):
        cluster_tmp = map_cluster.copy()
        print('switch', map_values[idx_max[1]], 'and:', ref_values[idx_max[0]])
        map_cluster[cluster_tmp==map_values[idx_max[1]]]=ref_values[idx_max[0]]
        map_cluster[cluster_tmp==map_values[idx_max[0]]]=ref_values[idx_max[1]]

        switched_col.add(idx_max[0])
        print(switched_col)

    else:
        break

print('final argsort contmat:')
print(sort_0)
print(sort_1)

print('final cont_mat:')
cont_mat = contingency_matrix(ref_cluster, map_cluster)
col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values))
row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values)))
cont_mat = cont_mat/(col_sum+row_sum)

print(cont_mat)

return(map_cluster)

And here is some test code:

ref_cluster = np.array([2,2,3,1,0,0,0,1,2,1,2,2,0,3,3,3,3])
map_cluster = np.array([0,0,0,1,1,3,2,3,2,2,0,0,0,2,0,3,3])

c = align_cluster_index(ref_cluster, map_cluster)
print(ref_cluster)
print(c)

>>>[2 2 3 1 0 0 0 1 2 1 2 2 0 3 3 3 3]
>>>[2 2 2 1 1 3 0 3 0 0 2 2 2 0 2 3 3]

0人赞添加讨论(0) 举报

reordering cluster numbers for correct corresponde

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间