How can I cluster points which are connected in MA

2019-02-21 05:22发布

问题:

Imagine we have many points which some of them are connected together and we want to cluster them.

Please take a look at the following figure.

If we have the "connectivity matrix" of points, how we can cluster them in two group (groups of connected points)?

ConnectivityMatrix=
                    [1 2
                     1 3
                     2 4
                     2 3
                     2 1
                     3 1
                     3 2
                     3 4
                     4 3
                     4 2
                     5 8
                     5 7
                     5 6
                     6 5
                     6 7
                     7 6
                     7 5
                     7 8
                     8 7
                     8 5]

The final result should be nodes of 1,2,3,4 in a first group(cluster) and nodes of 5,6,7,8 in a second group (cluster).

回答1:

Here is some code to get you started. I basically implemented Depth First Search... a very crude implementation of it anyway.

Depth First Search is an algorithm that is used for the traversal of trees. Graphs are essentially a special case of trees where there are leaf nodes that connect back to the root. The basic algorithm for Depth First Search is as so:

  • Start at the root of the tree and add this to a stack
  • For each node that is connected to the root, add this onto the stack and place this in a list of visited nodes
  • While there is still a node on the stack...
    1. Pop off a node off the stack
    2. Check the visited nodes list. If this is a node we have already visited, skip.
    3. Else, visit any nodes that are connected to this node we popped off and add to the stack

If we have disconnected graphs like what you have above, we basically run Depth First Search multiple times. Each time will be for one cluster. After one Depth First Search result, we will discover nodes that belong to one cluster. We restart Depth First Search again with any node that we have not touched yet, which will be a node from another cluster that we have not visited. As you clearly have two clusters in your graph structure, we will have to run Depth First Search two times. This is commonly referred to as finding all connected components in an overall graph.

To find the Connected Components, here are the steps that I did:

  1. Create a connectivity matrix
  2. Initialize a Boolean list that tells us whether or not we have visited a node in your graph
  3. Initialize an empty cluster list
  4. Initialize an empty stack that contains which nodes we should visit.
  5. While there is at least one node we need to visit...
    • Find such a node
    • Initialize our stack to contain this node
    • While our stack is not empty
      1. Pop off a node from the stack
      2. If we have visited this node, continue
      3. Else mark as visited
      4. Retrieve all nodes connected to this node
      5. Remove those nodes that are not on the stack in (4)
      6. Add these nodes to the stack and the cluster list
  6. Once the stack is empty, we have a list of all of the nodes contained in a single cluster. Add this cluster to a final list.
  7. Repeat 1 - 6 until we visit all nodes

Without further ado, this is the code. Bear in mind this is not battle tested. If you have graph structures that will generate an error, that'll be on your own to fix :)

ConnectivityMatrix = [1 2
                     1 3
                     2 4
                     2 3
                     2 1
                     3 1
                     3 2
                     3 4
                     4 3
                     4 2
                     5 8
                     5 7
                     5 6
                     6 5
                     6 7
                     7 6
                     7 5
                     7 8
                     8 7
                     8 5];

%// Find all possible node IDs
nodeIDs = unique(ConnectivityMatrix(:));

%// Flag that tells us if there are any nodes we should visit
nodeIDList = false(1,numel(nodeIDs));

%// Stores our list of clusters
clusterList = {};

%// Keeps track of how many clusters we have
counter = 1;

%// Stack - initialize to empty
stackNodes = [];

%// While there is at least one node we need to visit
while (~all(nodeIDList))    
    % Find any node
    stackNodes = find(nodeIDList == false, 1);
    % Initialize our stack to contain this node
    nodesCluster = stackNodes;

    %// While our stack is not empty
    while (~isempty(stackNodes))
        % Grab the node off the stack and pop off
        node = stackNodes(end);
        stackNodes(end) = [];        

        %// If we have marked this node as visited, skip
        if (nodeIDList(node))
            continue;
        end

        %// Mark as visited 
        nodeIDList(node) = true;

        %// Retrieve all nodes connected to this node
        connectedNodes = ConnectivityMatrix(ConnectivityMatrix(:,1) == node, :);
        nodesToVisit = unique(connectedNodes(:,2).');

        %// Remove those already visited
        visitedNodes = ~nodeIDList(nodesToVisit);
        finalNodesToVisit = nodesToVisit(visitedNodes);

        %// Add to cluster
        nodesCluster = unique([nodesCluster finalNodesToVisit]);

        %// Add to stack
        stackNodes = unique([stackNodes finalNodesToVisit]);                
    end

    %// Add connected components to its own cluster
    clusterList{counter} = nodesCluster;
    counter = counter + 1;
end

Once we have run this code, we can display our clusters like so:

celldisp(clusterList)

clusterList{1} =

 1     2     3     4


clusterList{2} =

 5     6     7     8

As such, cluster #1 contains nodes 1,2,3,4 while cluster #2 contains nodes 5,6,7,8.

Bear in mind that this code will only work if you sequentially label your nodes as you did in your diagram. You can't skip any label numbers (i.e. you can't do 1,2,4,6,9, etc. This should be 1,2,3,4,5).

Good luck!



回答2:

You can use "off-the-shelf" Matlab commands for this problem. For example, you can use graphconncomp.



回答3:

The answer from rayryeng is pretty good. However, here are some details I would like to point out:

  • "numel(nodeIDs)" would bring in possible errors if your node label is not sequentially (i.e., you have 10 nodes and the maximum node label is 20). You could switch "numel(nodeIDs)" to "max(nodeIDs) or a larger value."
  • I also believe the following code would bring in some problems (i.e., some nodes are missing and become isolated nodes) when utilizing this function:

    connectedNodes = ConnectivityMatrix(ConnectivityMatrix(:,1) == node, :);
    nodesToVisit = unique(connectedNodes(:,2).');
    

    I modified the simple two lines with the following messy code:

    connectedNodes1 = ConnectivityMatrix (ConnectivityMatrix (:,1) == node, :);
    connectedNodes2 = ConnectivityMatrix (ConnectivityMatrix (:,2) == node, :);
    AC=connectedNodes1(:,2).';
    AD=connectedNodes2(:,1).';
    ACA=reshape(AC,[],1);
    ADA=reshape(AD,[],1);
    AE= [ACA; ADA] ;
    AEA=reshape(AE,[],1);
    AEA=AEA';
    nodesToVisit = unique(AEA);
    

    After modifying these two points, there are no problems with rayryeng's initial code.