There seems to be a lot of information about creating either hierarchical or k-means clusters. But I would like to know if there is an solution in R that would create K clusters of approximately equal sizes. There is some stuff out there about doing this in other languages, but I have not been able to find anything from searching on the internet that suggests how to achieve the result in R.
An example would be
set.seed(123)
df <- matrix(rnorm(100*5), nrow=100)
km <- kmeans(df, 10)
print(sapply(1:10, function(n) sum(km$cluster==n)))
which results in
[1] 14 12 4 13 16 6 8 7 13 7
I would ideally like to see
[1] 10 10 10 10 10 10 10 10 10 10
I would argue that you shouldn't, in the first place. Why? When there are naturally well-formed clusters in your data, e.g.,
plot(matrix(c(sample(1:10,10),sample(30:40, 7), sample(80:90,9)), ncol=2, byrow = F))
then these will be clustered together anyway (assuming k equals the natural n of clusters; see this comprehensive answer on how to choose a good k). If they are uniform in size, then you will have clusters with ~equal size; if they are not, then forcing a uniform cluster size will surely deteriorate the fitness of the clustering solution.
If you do not have naturally pretty clusters in your data, e.g,
plot(matrix(c(sample(1:100, 100), ncol=2)))
then forcing a cluster size will either be redundant (if the data is completely random, the cluster sizes will be ~equal - but then there is not much point in clustering anyhow), or, if there are some nice clusters in there, e.g.,
plot(matrix(c(sample(1:15,15),sample(20:100, 11)), ncol=2, byrow = T))
then the forced size will almost certainly break them.
The Ward's method mentioned in the comments by JasonAizkalns will, however, give you more "round" shaped clusters compared to single-link for example, so that might be a way to go (cf. help(hclust)
for the difference between D and D2, it's not arbitrary).
Its not totally clear what you're asking, but it very easy to generate random data in R. If your data set has two dimensions you could do something like this -
cluster1 = data.frame(x = rnorm(100, mean=5,sd=1), y = rnorm(100, mean=5,sd=1))
cluster2 = data.frame(x = rnorm(100, mean=15,sd=1), y = rnorm(100, mean=15,sd=1))
This generates normally distributed random data across x and y for 100 data points in each cluster.
Then view it -
plot(cluster1, xlim = c(0,25), ylim = c(0,25))
lines(cluster2, type = "p")!