Algorithm for topological sorting if cycles exist

2020-05-18 03:47发布

问题:

Some programming languages (like haskell) allow cyclic dependencies between modules. Since the compiler needs to know all definitions of all modules imported while compiling one module, it usually has to do some extra work if some modules import each other mutually or any other kind of cycle occurs. In that case, the compiler may not be able to optimize code as much as in modules that have no import cycles, since imported functions may have not yet been analyzed. Usually only one module of a cycle has to be compiled that way, as a binary object has no dependecies. Let's call such a module loop-breaker

Especially if the import cycles are interleaved it is interesting to know, how to minimize the number of loop-breakers when compiling a big project composed of hundreds of modules.

Is there an algorithm that given a set of dependecies outputs a minimal number of modules that need to be compiled as loop-breakers to compile the program successfully?

Example

I try to clarify what I mean in this example.

Consider a project with the four modules A, B, C and D. This is a list of dependencies between these modules, an entry X y means y depends on x:

A C
A D
B A
C B
D B

The same relation visualized as an ASCII-diagram:

D ---> B
^    / ^
|   /  |
|  /   |
| L    |
A ---> C

There are two cycles in this dependency-graph: ADB and ACB. To break these cycles one could for instance compile modules C and D as loop-breakers. Obviously, this is not the best approach. Compiling A as a loop-breaker is completely sufficient to break both loops and you need to compile one less module as a loop-breaker.

回答1:

This is the NP-hard (and APX-hard) problem known as minimum feedback vertex set. An approximation algorithm due to Demetrescu and Finocchi (pdf, Combinatorial Algorithms for Feedback Problems in Directed Graphs (2003)") works well in practice when there are no long simple cycles, as I would expect for your application.



回答2:

Here is how to do it in Python:

from collections import defaultdict

def topological_sort(dependency_pairs):
    'Sort values subject to dependency constraints'
    num_heads = defaultdict(int)   # num arrows pointing in
    tails = defaultdict(list)      # list of arrows going out
    for h, t in dependency_pairs:
        num_heads[t] += 1
        tails[h].append(t)

    ordered = [h for h in tails if h not in num_heads]
    for h in ordered:
        for t in tails[h]:
            num_heads[t] -= 1
            if not num_heads[t]:
                ordered.append(t)
    cyclic = [n for n, heads in num_heads.iteritems() if heads]
    return ordered, cyclic

def is_toposorted(ordered, dependency_pairs):
    '''Return True if all dependencies have been honored.
       Raise KeyError for missing tasks.
    '''
    rank = {t: i for i, t in enumerate(ordered)}
    return all(rank[h] < rank[t] for h, t in dependency_pairs)

print topological_sort('aa'.split())
ordered, cyclic = topological_sort('ah bg cf ch di ed fb fg hd he ib'.split())
print ordered, cyclic
print is_toposorted(ordered, 'ah bg cf ch di ed fb fg hd he ib'.split())
print topological_sort('ah bg cf ch di ed fb fg hd he ib ba xx'.split())

The runtime is linearly proportional to the number of edges (dependency pairs).

The algorithm is organized around a lookup table called num_heads that keeps a count the number of predecessors (incoming arrows). In the ah bg cf ch di ed fb fg hd he ib example, the counts are:

node  number of incoming edges
----  ------------------------
 a       0
 b       2
 c       0
 d       2
 e       1
 f       1  
 g       2
 h       2
 i       1

The algorithm works by "visting" nodes with no predecessors. For example, nodes a and c have no incoming edges, so they are visited first.

Visiting means that the nodes are output and removed from the graph. When a node is visited, we loop over its successors and decrement their incoming count by one.

For example, in visiting node a, we go to its successor h to decrement its incoming count by one (so that h 2 becomes h 1.

Likewise, when visiting node c, we loop over its successors f and h, decrementing their counts by one (so that f 1 becomes f 0 and h 1 becomes h 0).

The nodes f and h no longer have incoming edges, so we repeat the process of outputting them and removing them from the graph until all the nodes have been visited. In the example, the visitation order (the topological sort is):

a c f h e d i b g

If num_heads ever arrives at a state when there are no nodes without incoming edges, then it means there is a cycle that cannot be topologically sorted and the algorithm exits.