Python Clustering Algorithms

2020-06-09 05:37发布

问题:

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).

I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.

Here is an example of what dbscan find:

You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.

Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.

回答1:

I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).

Here's an example of what it gives:

I also want to point out that in the documentation is states that it may not scale well.



回答2:

  • When using DBSCAN it can be helpful to scale/normalize data or distances beforehand, so that estimation of epsilon will be relative.

  • There is a implementation of DBSCAN - I think its the one Anony-Mousse somewhere denoted as 'floating around' - , which comes with a epsilon estimator function. It works, as long as its not fed with large datasets.

  • There are several incomplete versions of OPTICS at github. Maybe you can find one to adapt it for your purpose. Still trying to figure out myself, which effect minPts has, using one and the same extraction method.



回答3:

You can try a minimum spanning tree (zahn algorithm) and then remove the longest edge similar to alpha shapes. I used it with a delaunay triangulation and a concave hull:http://www.phpdevpad.de/geofence. You can also try a hierarchical cluster for example clusterfck.



回答4:

Your plot indicates that you chose the minPts parameter way too small.

Have a look at OPTICS, which does no longer need the epsilon parameter of DBSCAN.