Meanshift in scikit learn (python) doesn't und

2019-08-03 00:09发布

问题:

I have a dataset which has 7265 samples and 132 features. I want to use the meanshift algorithm from scikit learn but I ran into this error:

Traceback (most recent call last):
  File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 130, in <module>
    labels, centers = getClusters(data,clusters)
  File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 34, in getClusters
    ms.fit(np.array(dataarray))
  File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 280, in fit
    cluster_all=self.cluster_all)
  File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 137, in mean_shift
    nbrs = NearestNeighbors(radius=bandwidth).fit(sorted_centers)
  File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 642, in fit
    return self._fit(X)
  File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 180, in _fit
    raise ValueError("data type not understood")
ValueError: data type not understood

My code:

dataarray = np.array(data)
bandwidth = estimate_bandwidth(dataarray, quantile=0.2, n_samples=len(dataarray))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(dataarray)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

If I check the datatype of the data variable I see:

print isinstance( dataarray, np.ndarray )
>>> True

The bandwidth is 0.925538333061 and the dataarray.dtype is float64

I'm using scikit learn 0.14.1

I can cluster with other algorithms in sci-kit (tried kmeans and dbscan). What am I doing wrong ?


EDIT:

The data can be found here: (pickle format) : http://ojtwist.be/datatocluster.p and : http://ojtwist.be/datatocluster.npz

回答1:

That`s a bug in scikit project. It is documented here.

There is a float -> int casting during the fitting process that can crash in some cases (by making the seed points be placed at the corner of the bins instead in the center). There is some code in the link to fix the problem.

If you don't wanna get your hands into the scikit code (and maintain compatibility between your code with other machines) i suggest you normalize your data before passing it to MeanShift.

Try this:

>>>from sklearn import preprocessing
>>>data2 = preprocessing.scale(dataarray)

And then use data2 into your code. It worked for me.

If you don't want to do either solution, it is a great opportunity to contribute to the project, making a pull request with the solution :)

Edit: You probably want to retain information to "descale" the results of meanshift. So, use a StandardScaler object, instead using a function to scale.

Good luck!