I have an undirected graph G=(V,E) where each vertex represents a router in a large network. Each edge represents a network hop from one router to the other therefore, all edges have the same weight. I wish to partition this network of routers into 3 or k different sets clustered by Hop count.
Motivation:
The idea is to replicate some data in routers contained in each of these 3 sets. This is so that whenever a node( or client or whatever) in the network graph requests for a certain data item, I can search for it in the 3 sets and get a responsible node(one that has cached that particular data) from each set. Then I'd select the node which is at the minimum hop count away from the requesting node and continue with my algorithms and tests.
The cache distribution and request response are not in the scope of this question. I just need a way to partition the network into 3 sets so that I can perform the above operations on it.
Which clustering algorithm could be used in such a scenario. I have almost 9000 nodes in the graph and I wish to get 3 sets of ~3000 nodes each
In the graph case, a clustering method based on minimum spanning trees can be used.
The regular algorithm is the following:
- Find the minimum spanning tree of the graph.
- Remove the
k - 1
longest edges in the spanning tree, where k
is the desired number of clusters.
However, this works only if the edges differ in length (or weight). In the case of edges of equal length, every spanning tree is a minimum one so this would not work well. However, putting a little thinking into it, a different algorithm came to my mind which uses BFS.
The algorithm:
1. for i = 1..k do // for each cluster
2. choose the number of nodes N in cluster i
3. choose an arbitrary node n
4. run breadth-first search (BFS) from n until N
5. assign the first N nodes (incl. n) tapped by the BFS to the i-th cluster
6. remove these nodes (and the incident edges) from the graph
7. done
This algorithm (the results) hugely depends on how step 3, i.e. choosing the "root" node of a cluster, is implemented. For the sake of simplicity I choose an arbitrary node, but it could be more sophisticated. The best nodes are those that are the at the "end" of the graph. You could find a center of the graph (a node that has the lowest sum of lengths of paths to all other nodes) and then use the nodes that are the furthes from this center.
The real issue is that your edges are equal (if I understood your problem statement well) and you have no information about the nodes (i.e. their coordinates - then you could use e.g. k-means).