Greedy clustering algorithm speed improvement

2020-05-01 08:41发布

问题:

I am trying to implement a very simple greedy clustering algorithm in python, but am hard-pressed to optimize it for speed. The algorithm will take a distance matrix, find the column with the most components less than a predetermined distance cutoff, and store the row indices (with components less than the cutoff) as the members of the cluster. The centroid of the cluster is the column index. The columns and rows of each member index are then removed from the distance matrix (resulting in a smaller --but still square-- matrix), and the algorithm iterates through successively smaller distance matrices until all clusters are found. Because each iteration depends on the last (a new distance matrix is formed so that there are no overlapping members between clusters), I think I can not avoid a slow for loop in python. I've tried numba (jit) to speed it up but I think it is reverting to python mode and so does not result in any speed gains. Here are two implementations of the algorithm. The first is slower than the latter. Any suggestions for speedups are most welcome. I am aware of other clustering algorithms as implemented in scipy or sklearn (such as DBSCAN, kmeans/medoids, etc), but am very keen to use the current one for my application. Thanks in advance for any suggestions.

Method 1 (slower):

def cluster(distance_matrix, cutoff=1):
    indices = np.arange(0, len(distance_matrix))
    boolean_distance_matrix = distance_matrix <= cutoff
    centroids = []
    members = []
    while boolean_distance_matrix.any():
        centroid = np.argmax(np.sum(boolean_distance_matrix, axis=0))
        mem_indices = boolean_distance_matrix[:, centroid]
        mems = indices[mem_indices]
        boolean_distance_matrix[mems, :] = False
        boolean_distance_matrix[:, mems] = False
        centroids.append(centroid)
        members.append(mems)
    return members, centroids

Method 2 (faster, but still slow for large matrices):

It takes as input an adjacency (sparse) matrix formed from sklearn's nearest neighbors implementation. This is the simplest and fastest way I could think to get the relevant distance matrix for clustering. I believe working with the sparse matrix also speeds up the clustering algorithm.

nbrs = NearestNeighbors(metric='euclidean', radius=1.5, 
algorithm='kd_tree')            
nbrs.fit(data)    
adjacency_matrix = nbrs.radius_neighbors_graph(data)   

def cluster(adjacency_matrix, gt=1):
    rows = adjacency_matrix.nonzero()[0]
    cols = adjacency_matrix.nonzero()[1]
    members = []
    member = np.ones(len(range(gt+1)))
    centroids = []
    appendc = centroids.append
    appendm = members.append
    while len(member) > gt:
        un, coun = np.unique(cols, return_counts=True)
        centroid = un[np.argmax(coun)]
        appendc(centroid)
        member = rows[cols == centroid]
        appendm(member)
        cols = cols[np.in1d(rows, member, invert=True)]
        rows = rows[np.in1d(rows, member, invert=True)]
    return members, centroids