Implement Kahn's topological sorting algorithm

2019-03-01 16:22发布

问题:

Kahn proposed an algorithm in 62 to topologically sort any DAG (directed acyclic graph), pseudo code copied from Wikipedia:

L ← Empty list that will contain the sorted elements 
S ← Set of all nodes with no incoming edges 
while S is non-empty do
    remove a node n from S
    add n to tail of L
    for each node m with an edge e from n to m do
        remove edge e from the graph  # This is a DESTRUCTIVE step!
        if m has no other incoming edges then
            insert m into S if graph has edges then
    return error (graph has at least one cycle) else 
    return L (a topologically sorted order)

I need to implement it using IPython3, with the following implementation of a DAG:

class Node(object):
    def __init__(self, name, parents):
        assert isinstance(name, str)
        assert all(isinstance(_, RandomVariable) for _ in parents)
        self.name, self.parents = name, parents

where name is the label for the node and parents stores all of its parent nodes. Then the DAG class is implemented as:

class DAG(object):
    def __init__(self, *nodes):
        assert all(isinstance(_, Node) for _ in nodes)
        self.nodes = nodes

(The DAG implementation is fixed and not to be improved.) Then I need to implement Kahn's algorithm as a function top_order which takes in a DAG instance and returns an ordering like (node_1, node_2, ..., node_n). The main trouble is, this algorithm is destructive because one of its steps is remove edge e from the graph (line 5) which will delete one member of m.parents. However, I have to leave the DAG instance intact.

One way I can think of so far is to create a deep copy of the DAG instance taken in (even a shallow copy can't do the job because the algorithm still destroys the original instance via references), and perform the destructive algorithm on this copy, and then get the correct ordering of node names of this copy (assume there is no naming conflict between nodes), and then use this ordering of names to infer the correct ordering of the nodes of the original instance, which roughly goes like:

def top_order(network):
    '''takes in a DAG, prints and returns a topological ordering.'''
    assert type(network) == DAG
    temp = copy.deepcopy(network) # to leave the original instance intact

    ordering_name = []
    roots = [node for node in temp.nodes if not node.parents]
    while roots:
        n_node = roots[0]
        del roots[0]
        ordering_name.append(n_node.name)
        for m_node in temp.nodes:
            if n_node in m_node.parents:
                temp_list = list(m_node.parents)
                temp_list.remove(n_node)
                m_node.parents = tuple(temp_list)
                if not m_node.parents:
                    roots.append(m_node)

    print(ordering_name) # print ordering by name

    # gets ordering of nodes of the original instance
    ordering = []
    for name in ordering_name:
        for node in network.nodes:
            if node.name == name:
                ordering.append(node)

    return tuple(ordering)

Two problems: first, when network is huge the deep copy will be resource consuming; second, I want an improvement to my nested for loops which gets the ordering of the original instance. (For the second I think something like the sorted method etc pops into my mind.)

Any suggestion?

回答1:

I'm going to suggest a less literal implementation of the algorithm: you don't need to manipulate the DAG at all, you just need to manipulate info about the DAG. The only "interesting" things the algorithm needs are a mapping from a node to its children (the opposite of what your DAG actually stores), and a count of the number of each node's parents.

These are easy to compute, and dicts can be used to associate this info with node names (assuming all names are distinct - if not, you can invent unique names with a bit more code).

Then this should work:

def topsort(dag):
    name2node = {node.name: node for node in dag.nodes}
    # map name to number of predecessors (parents)
    name2npreds = {}
    # map name to list of successors (children)
    name2succs = {name: [] for name in name2node}

    for node in dag.nodes:
        thisname = node.name
        name2npreds[thisname] = len(node.parents)
        for p in node.parents:
            name2succs[p.name].append(thisname)

    result = [n for n, npreds in name2npreds.items() if npreds == 0]
    for p in result:
        for c in name2succs[p]:
            npreds = name2npreds[c]
            assert npreds
            npreds -= 1
            name2npreds[c] = npreds
            if npreds == 0:
                result.append(c)

    if len(result) < len(name2node):
        raise ValueError("no topsort - cycle")
    return tuple(name2node[p] for p in result)

There's one subtle point here: the outer loop appends to result while it's iterating over result. That's intentional. The effect is that every element in result is processed exactly once by the outer loop, regardless of whether an element was in the initial result or added later.

Note that while the input DAG and Nodes are traversed, nothing in them is altered.