I am looking for an algorithm that solves the following problem:
- Given: a set of items and their similarity matrix.
- Goal: group these items in "clusters" of minimum size m
- Conditions:
- There are no cluster-like structures in the dataset, as shown in
Figure 1
- Anyway, the items in a group should be similar to each other. Thus, the global similarity would be high.
The motivation is not to identify good clusters but to split a dataset into groups of high similarity and of minimum size. Partitioning around medoids does not work out-of-the box, it would produce a lot of 1-item-clusters. Hierarchical approaches (AGNES, DIANA) does not help either.
This problem is someway similar to Stable Marriage problems: one tries to rank the neighbored items by similarity. But here, there are at least m items in one group / one marriage.
Thanks in advance!
This is not clustering then. A clustering algorithm should tell your that there are no clusters there. Your task sounds more like bin packing, knapsack and similar optimization problems to me.
Without any further constraints, your problem is also underspecified.
Why don't you try a greedy heuristic (which is common to use for knapsack like problems). Choose any point at random, add enough points to satisfy your minimum size constraint.
Then choose the furthest point from this, and add again enough points to satisfy your minimum size. Repeat (using sum of distances for ranking), until you no longer can satisfy the minimum size. Then add every remaining point to the nearest cluster each. Finally, do an optimization pass moving points to other clusters as long as the minimum size is satisfied?
While this is not a typical clustering task (see @Anony-Mousse), you can modify a clustering algorithm to suit your needs.
You can follow this Tutorial for Same-Size K-Means, or simply use this algorithm from the tutorial
package/module in ELKI (build the latest version from GitHub, because I just fixed a bug there).
Essentially, this algorithm performs a k-means style least-squares optimization, but all clusters must have the same size (if N/k is not integer, the cluster sizes may vary by 1).
If you go to above tutorial and scroll to the bottom, you can see example results.