Topological Sort with Grouping

2019-02-09 17:42发布

问题:

Ok, so in topological sorting depending on the input data, there's usually multiple correct solutions for which order the graph can be "processed" so that all dependencies come before nodes that are "dependent" on them. However, I'm looking for a slightly different answer:

Suppose the following data: a -> b and c -> d (a must come before b and c must come before d).
With just these two constraints we have multiple candidate solutions: (a b c d, a c d b, c a b d, etc). However, I'm looking to create a method of "grouping" these nodes so that after the processing of a group, all of the entries in the next group have their dependencies taken care of. For the above supposed data I'd be looking for a grouping like (a, c) (b, d). Within each group it doesn't matter which order the nodes are processed (a before c or b before d, etc and vice versa) just so long as group 1 (a, c) completes before any of group 2 (b, d) are processed.

The only additional catch would be that each node should be in the earliest group possible. Consider the following:
a -> b -> c
d -> e -> f
x -> y

A grouping scheme of (a, d) (b, e, x) (c, f, y) would technically be correct because x is before y, a more optimal solution would be (a, d, x) (b, e, y) (c, f) because having x in group 2 implies that x was dependent on some node in group 1.

Any ideas on how to go about doing this?


EDIT: I think I managed to slap together some solution code. Thanks to all those who helped!

// Topological sort
// Accepts: 2d graph where a [0 = no edge; non-0 = edge]
// Returns: 1d array where each index is that node's group_id
vector<int> top_sort(vector< vector<int> > graph)
{
    int size = graph.size();
    vector<int> group_ids = vector<int>(size, 0);
    vector<int> node_queue;

    // Find the root nodes, add them to the queue.
    for (int i = 0; i < size; i++)
    {
        bool is_root = true;

        for (int j = 0; j < size; j++)
        {
            if (graph[j][i] != 0) { is_root = false; break; }
        }

        if (is_root) { node_queue.push_back(i); }
    }

    // Detect error case and handle if needed.
    if (node_queue.size() == 0)
    {
        cerr << "ERROR: No root nodes found in graph." << endl;
        return vector<int>(size, -1);
    }


    // Depth first search, updating each node with it's new depth.
    while (node_queue.size() > 0)
    {
        int cur_node = node_queue.back();
        node_queue.pop_back();

        // For each node connected to the current node...
        for (int i = 0; i < size; i++)
        {
            if (graph[cur_node][i] == 0) { continue; }

            // See if dependent node needs to be updated with a later group_id
            if (group_ids[cur_node] + 1 > group_ids[i])
            {
                group_ids[i] = group_ids[cur_node] + 1;
                node_queue.push_back(i);
            }
        }
    }

    return group_ids;
}

回答1:

Label all root nodes with a level value 0. Label all children with level value parent+1. If, a node is being revisited i.e it already has a level value assigned, check if the previously assigned value is lower than the new one. If so, update it with the higher value and propagate them to the descendents.

now, you have as many groups as there are unique level labels 0 ... K



回答2:

I recently implemented this algorithm. I started with the approach you have shown, but it didn't scale to graphs of 20+ million nodes. The solution I ended up with is based on the approach detailed here.

You can think of it as computing the height of each node, and then the result is a group of each node at a given height.

Consider the graph:

A -> X

B -> X

X -> Y

X -> Z

So the desired output is (A,B), (X), (Y, Z)

The basic approach is to find everything with nothing using it(A,B in this example). All of these are at height 0.

Now remove A and B from the graph, find anything that now has nothing using it(now X in this example). So X is at height 1.

Remove X from the graph, find anything that now has nothing using it(now Y,Z in this example). so Y,Z are at height 2.

You can make an optimization by realizing the fact that you don't need to store bidirectional edges for everything or actually remove anything from your graph, you only need to know the number of things pointing to a node and the nodes you know are at the next height.

So for this example at the start:

  • 0 things use 1
  • 0 things use 2
  • 2 things use X (1 and 2)
  • 1 things use Y,Z (X)

When you visit a node, decrease the number of each of the nodes it points to, if that number goes to zero, you know that node is at the next height.