In R, is there an algorithm to create approximatel

2019-05-02 01:09发布

问题:

There seems to be a lot of information about creating either hierarchical or k-means clusters. But I would like to know if there is an solution in R that would create K clusters of approximately equal sizes. There is some stuff out there about doing this in other languages, but I have not been able to find anything from searching on the internet that suggests how to achieve the result in R.

An example would be

set.seed(123)
df <- matrix(rnorm(100*5), nrow=100)
km <- kmeans(df, 10)
print(sapply(1:10, function(n) sum(km$cluster==n)))

which results in

[1] 14 12  4 13 16  6  8  7 13  7

I would ideally like to see

[1] 10 10 10 10 10 10 10 10 10 10 

回答1:

I would argue that you shouldn't, in the first place. Why? When there are naturally well-formed clusters in your data, e.g.,

plot(matrix(c(sample(1:10,10),sample(30:40, 7), sample(80:90,9)), ncol=2, byrow = F))

then these will be clustered together anyway (assuming k equals the natural n of clusters; see this comprehensive answer on how to choose a good k). If they are uniform in size, then you will have clusters with ~equal size; if they are not, then forcing a uniform cluster size will surely deteriorate the fitness of the clustering solution. If you do not have naturally pretty clusters in your data, e.g,

plot(matrix(c(sample(1:100, 100), ncol=2)))

then forcing a cluster size will either be redundant (if the data is completely random, the cluster sizes will be ~equal - but then there is not much point in clustering anyhow), or, if there are some nice clusters in there, e.g.,

plot(matrix(c(sample(1:15,15),sample(20:100, 11)), ncol=2, byrow = T))

then the forced size will almost certainly break them.

The Ward's method mentioned in the comments by JasonAizkalns will, however, give you more "round" shaped clusters compared to single-link for example, so that might be a way to go (cf. help(hclust) for the difference between D and D2, it's not arbitrary).



回答2:

Its not totally clear what you're asking, but it very easy to generate random data in R. If your data set has two dimensions you could do something like this -

cluster1 = data.frame(x = rnorm(100, mean=5,sd=1), y  = rnorm(100, mean=5,sd=1))
cluster2 = data.frame(x = rnorm(100, mean=15,sd=1), y  = rnorm(100, mean=15,sd=1))

This generates normally distributed random data across x and y for 100 data points in each cluster.

Then view it -

plot(cluster1, xlim = c(0,25), ylim = c(0,25))
lines(cluster2, type = "p")!