This question already has an answer here:

Cluster one-dimensional data optimally? [closed] 1 answer
1D Number Array Clustering [duplicate] 2 answers

I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set.

The sorted output is something like this:

[1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]

If you lay these values down on a spreadsheet you see that they make up groups

[1,1,5,6,1,5] [10,22,23,23] [50,51,51,52] [100,112,130] [500,512,600] [12000,12230]

Is there a way to programatically get those groupings?

Maybe some clustering algorithm using a machine learning library? Or am I overthinking this?

I've looked at scikit but their examples are way too advanced for my problem...

标签： python machine-learning cluster-analysis data-mining

3条回答

何必那么认真

2楼-- · 2019-01-16 15:50

A good option if you don't know the number of clusters is MeanShift:

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth

x = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]

X = np.array(zip(x,np.zeros(len(x))), dtype=np.int)
bandwidth = estimate_bandwidth(X, quantile=0.1)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

for k in range(n_clusters_):
    my_members = labels == k
    print "cluster {0}: {1}".format(k, X[my_members, 0])

Output for this algorithm:

cluster 0: [ 1  1  5  6  1  5 10 22 23 23 50 51 51 52]
cluster 1: [100 112 130]
cluster 2: [500 512]
cluster 3: [12000]
cluster 4: [12230]
cluster 5: [600]

Modifying quantilevariable you can change the clustering number selection criteria

0人赞添加讨论(0) 举报

劳资没心，怎么记你

3楼-- · 2019-01-16 15:52

Don't use clustering for 1-dimensional data

Clustering algorithms are designed for multivariate data. When you have 1-dimensional data, sort it, and look for the largest gaps. This is trivial and fast in 1d, and not possible in 2d. If you want something more advanced, use Kernel Density Estimation (KDE) and look for local minima to split the data set.

There are a number of duplicates of this question:

0人赞添加讨论(0) 举报

别忘想泡老子

4楼-- · 2019-01-16 15:53

You can use clustering to group these. The trick is to understand that there are two dimensions to your data: the dimension you can see, and the "spatial" dimension that looks like [1, 2, 3... 22]. You can create this matrix in numpy like so:

import numpy as np

y = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
x = range(len(y))
m = np.matrix([x, y]).transpose()

Then you can perform clustering on the matrix, with:

from scipy.cluster.vq import kmeans
kclust = kmeans(m, 5)

kclust's output will look like this:

(array([[   11,    51],
       [   15,   114],
       [   20, 12115],
       [    4,     9],
       [   18,   537]]), 21.545126372346271)

For you, the most interesting part is the first column of the matrix, which says what the centers are along that x dimension:

kclust[0][:, 0]
# [20 18 15  4 11]

You can then assign your points to a cluster based on which of the five centers they are closest to:

assigned_clusters = [abs(cluster_indices - e).argmin() for e in x]
# [3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 2, 2, 2, 2, 1, 1, 0, 0, 0]

0人赞添加讨论(0) 举报

Clustering values by their proximity in python (ma

Don't use clustering for 1-dimensional data

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间