I have a dataset which has 7265 samples and 132 features. I want to use the meanshift algorithm from scikit learn but I ran into this error:
Traceback (most recent call last):
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 130, in <module>
labels, centers = getClusters(data,clusters)
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 34, in getClusters
ms.fit(np.array(dataarray))
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 280, in fit
cluster_all=self.cluster_all)
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 137, in mean_shift
nbrs = NearestNeighbors(radius=bandwidth).fit(sorted_centers)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 642, in fit
return self._fit(X)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
My code:
dataarray = np.array(data)
bandwidth = estimate_bandwidth(dataarray, quantile=0.2, n_samples=len(dataarray))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(dataarray)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
If I check the datatype of the data variable I see:
print isinstance( dataarray, np.ndarray )
>>> True
The bandwidth is 0.925538333061 and the dataarray.dtype
is float64
I'm using scikit learn 0.14.1
I can cluster with other algorithms in sci-kit (tried kmeans and dbscan). What am I doing wrong ?
EDIT:
The data can be found here: (pickle format) : http://ojtwist.be/datatocluster.p and : http://ojtwist.be/datatocluster.npz
That`s a bug in scikit project. It is documented here.
There is a float -> int casting during the fitting process that can crash in some cases (by making the seed points be placed at the corner of the bins instead in the center). There is some code in the link to fix the problem.
If you don't wanna get your hands into the scikit code (and maintain compatibility between your code with other machines) i suggest you normalize your data before passing it to MeanShift.
Try this:
And then use data2 into your code. It worked for me.
If you don't want to do either solution, it is a great opportunity to contribute to the project, making a pull request with the solution :)
Edit: You probably want to retain information to "descale" the results of meanshift. So, use a StandardScaler object, instead using a function to scale.
Good luck!