This is in continuation with the question posted here: Finding the center of mass on a 2D bitmap which talked about finding the center of mass in a boolean matrix, as the example was given.
Suppose now we expand the matrix to this form:
0 1 2 3 4 5 6 7 8 9
1 . X X . . . . . .
2 . X X X . . X . .
3 . . . . . X X X .
4 . . . . . . X . .
5 . X X . . . . . .
6 . X . . . . . . .
7 . X . . . . . . .
8 . . . . X X . . .
9 . . . . X X . . .
As you can see we now have 4 centers of mass, for 4 different clusters.
We already know how to find a center of mass given that only one exists, if we run that algorithm on this matrix we'll get some point in the middle of the matrix which does not help us.
What can be a good, correct and fast algorithm to find these clusters of mass?
I think I would check each point in the matrix and figure out it's mass based on it's neighbours. The mass for points would fall with say the square of the distance. You could then pick the top four points with a minimum distance from each other.
Here's some Python code I whipped together to try to illustrate the approach for finding out the mass for each point. Some setup using your example matrix:
To calculate the mass for a given point:
Note: I'm using Manhattan distances (a k a Cityblock, a k a Taxicab Geometry) here because I don't think the added accuracy using Euclidian distances is worth the cost of calling sqrt().
Iterating through our matrix and building up a list of tuples like (x, y, mass(x,y)):
Sorting the list on the mass for each point:
Looking at the top 9 points in that sorted list:
If we would work from highest to lowest and filter away points that are too close to already seen points we'll get (I'm doing it manually since I've run out of time now to do it in code...):
Which is a pretty intuitive result from just looking at your matrix (note that the coordinates are zero based when comparing with your example).
Here's a similar question with a not so fast algorithm, and several other better ways to do it.
You need a clustering algorithm, this is easy since you just have a 2 dimensional grid, and the entries are bordering each other. You can just use a floodfill algorithm. Once you have each cluster, you can find the center as in the 2D center of mass article..
My first thought would be to first find any cell with a non-zero value. From there do some flood-fill algorithm, and compute the center of mass of the cells found. Next zero out the found cells from the matrix, and start over from the top.
This would of course not scale as well as the method from Google, that tuinstoel linked, but would be easier to implement for smaller matrices.
EDIT:
Disjoint sets (using path compression and union-by-rank) could be useful here. They have O(α(n)) time complexity for union and find-set, where
Ak(n) is the Ackerman function, so α(n) will essentially be O(1) for any reasonable values. The only problem is that disjoint sets are a one-way mapping of item to set, but this won't matter if you are going trough all items.
Here is a simple python script for demonstration:
Output:
The point of this was to demonstrate disjoint sets. The actual algorithm in
find_clusters()
could be upgraded to something more robust.References