Efficiently grouping similar numbers together [dup

2019-03-04 14:31发布

问题:

Possible Duplicate:
1D Number Array Clustering

I have an array of numbers like [1, 20, 300, 45, 5, 60, 10, 270, 3]. What is an efficient algorithm for grouping these numbers together based on proximity? In this case I'd expect something like [1, 3, 5], [20, 45, 60] and [270, 300].

回答1:

The hardest part of what you are asking is how to actually define proximity. What would you expect the output to be from [5,10,15,20]? Would it be the same groupings as for [500,1000,1500,2000]?

What about [1,2,3,5,7,8,9]? Should there be one group or three? (or two?).
What about [1,2,3,5,7,8,9,1075,4000]? Do 1075 and 4000 become grouped together? Does the groupings of the smaller numbers become changed by the larger numbers in the sample?

This question is something asked by an entire field of Machine Learning: Cluster Analysis Perhaps this related question will help?

I think what you want is K-means clustering (helpfully linked to in the related question), but you need to know how many groups you want to split your data into to use it.



回答2:

This might be massive overkill, but you might want to look into hierarchical clustering algorithms. These algorithms group together values into a hierarchy, from which you can easily extract the best k clusters. Agglomerative clustering is probably the easiest of these approaches to implement, and from experience it tends to produce very good clusters.

Hope this helps!