How do I generate all possible Newick Tree permuta

2019-04-07 23:13发布

How do I generate all possible Newick Tree permutations for a set of species given an outgroup?

For those who don't know what Newick tree format is, a good description is available at: https://en.wikipedia.org/wiki/Newick_format

I want to create all possible Newick Tree permutations for a set of species given an outgroup. The number of leaf nodes I expect to process are most likely 4, 5, or 6 leaf nodes.

Both "Soft" and "hard" polytomies are allowed. https://en.wikipedia.org/wiki/Polytomy#Soft_polytomies_vs._hard_polytomies https://biology.stackexchange.com/questions/23667/evidence-discussions-of-hard-polytomy

Shown below is the ideal output, with "E" set as the outgroup

Ideal Output:

((("A","B","C"),("D"),("E"));
((("A","B","D"),("C"),("E"));
((("A","C","D"),("B"),("E"));
((("B","C","D"),("A"),("E"));
((("A","B")("C","D"),("E"));
((("A","C")("B","D"),("E"));
((("B","C")("A","D"),("E"));
(("A","B","C","D"),("E"));
(((("A","B"),"C"),"D"),("E"));

However, any possible solutions that I've come with using itertools, specifically itertools.permutations, have come across the problem of equivalent output. The last idea I came up with involved the equivalent output that is shown below.

Equivalent output:

((("C","B","A"),("D"),("E"));
((("C","A","B"),("D"),("E"));
((("A","C","B"),("D"),("E"));

Here is the start of my idea for a solution. However, I'm not really sure what to about this problem besides itertools for now.

import itertools

def Newick_Permutation_Generator(list_of_species, name_of_outgroup)
    permutations_list =(list(itertools.permutations(["A","B","C","D","E"])))

    for given_permutation in permutations_list:
        process(given_permutation)

Newick_Permutation_Generator(["A","B","C","D","E"], "E")

2条回答
甜甜的少女心
2楼-- · 2019-04-07 23:35

This was a hard question! Here is the journey I took.

First observation is that the outgroup is always a single node tacked onto the end of the newick string. Let's call the rest of the species the ingroup and try to generate all the permutations of these. Then simply add the outgroup.

from itertools import permutations

def ingroup_generator(species, n):
    for perm in permutations(species, n):
        yield tuple([tuple(perm), tuple(s for s in species if s not in perm)])

def format_newick(s, outgroup=''):
    return '(' + ', '.join('({})'.format(', '.join(p)) for p in s) + ',({}));'.format(outgroup)

species = ["A","B","C","D","E"]
outgroup = "E"
ingroup = [s for s in species if s != outgroup]

itertools_newicks= []
for n in range(1, len(ingroup)):
    for p in ingroup_generator(ingroup, n):
        itertools_newicks.append(format_newick(p, outgroup))

for newick in itertools_newicks:
    print newick

This returns 40 newick strings:

((A), (B, C, D),(E));
((B), (A, C, D),(E));
((C), (A, B, D),(E));
((D), (A, B, C),(E));
((A, B), (C, D),(E));
((A, C), (B, D),(E));
((A, D), (B, C),(E));
((B, A), (C, D),(E));
((B, C), (A, D),(E));
((B, D), (A, C),(E));
((C, A), (B, D),(E));
((C, B), (A, D),(E));
((C, D), (A, B),(E));
((D, A), (B, C),(E));
((D, B), (A, C),(E));
((D, C), (A, B),(E));
((A, B, C), (D),(E));
((A, B, D), (C),(E));
((A, C, B), (D),(E));
((A, C, D), (B),(E));
((A, D, B), (C),(E));
((A, D, C), (B),(E));
((B, A, C), (D),(E));
((B, A, D), (C),(E));
((B, C, A), (D),(E));
((B, C, D), (A),(E));
((B, D, A), (C),(E));
((B, D, C), (A),(E));
((C, A, B), (D),(E));
((C, A, D), (B),(E));
((C, B, A), (D),(E));
((C, B, D), (A),(E));
((C, D, A), (B),(E));
((C, D, B), (A),(E));
((D, A, B), (C),(E));
((D, A, C), (B),(E));
((D, B, A), (C),(E));
((D, B, C), (A),(E));
((D, C, A), (B),(E));
((D, C, B), (A),(E));

Some of these are duplicates, but we will get to removing the duplicates later.

As bli noted in the comments, (((("A","B"),"C"),"D"),("E")); and its variants should also be considered valid solutions. The comments on BioStar pointed me in the right direction that this is the same as generating all the possible groupings of a binary tree. I found a nice Python implementation in this StackOverflow answer by rici:

# A very simple representation for Nodes. Leaves are anything which is not a Node.
class Node(object):
  def __init__(self, left, right):
    self.left = left
    self.right = right

  def __repr__(self):
    return '(%s, %s)' % (self.left, self.right)

# Given a tree and a label, yields every possible augmentation of the tree by
# adding a new node with the label as a child "above" some existing Node or Leaf.
def add_leaf(tree, label):
  yield Node(label, tree)
  if isinstance(tree, Node):
    for left in add_leaf(tree.left, label):
      yield Node(left, tree.right)
    for right in add_leaf(tree.right, label):
      yield Node(tree.left, right)

# Given a list of labels, yield each rooted, unordered full binary tree with
# the specified labels.
def enum_unordered(labels):
  if len(labels) == 1:
    yield labels[0]
  else:
    for tree in enum_unordered(labels[1:]):
      for new_tree in add_leaf(tree, labels[0]):
        yield new_tree

Then,

enum_newicks= []
for t in enum_unordered(ingroup):
    enum_newicks.append('({},({}));'.format(t, outgroup))

for newick in enum_newicks:
    print newick

produces the following 15 newicks:

((A, (B, (C, D))),(E));
(((A, B), (C, D)),(E));
((B, (A, (C, D))),(E));
((B, ((A, C), D)),(E));
((B, (C, (A, D))),(E));
((A, ((B, C), D)),(E));
(((A, (B, C)), D),(E));
((((A, B), C), D),(E));
(((B, (A, C)), D),(E));
(((B, C), (A, D)),(E));
((A, (C, (B, D))),(E));
(((A, C), (B, D)),(E));
((C, (A, (B, D))),(E));
((C, ((A, B), D)),(E));
((C, (B, (A, D))),(E));

So now we already have 40 + 15 = 55 possible newick strings and we have to remove the duplicates.

I first dead end that I tried was to create a canonical representation of each newick string so that I could use these as keys in a dictionary. The idea was to recursively sort the strings in all the nodes. But first I had to capture all the (nested) nodes. I couldn't use regular expressions, because nested structures are by definition not regular.

So I used the pyparsing package and come up with this:

from pyparsing import nestedExpr 

def sort_newick(t):
    if isinstance(t, str):
        return sorted(t)
    else:
        if all(isinstance(c, str) for c in t):
            return sorted(t)
        if all(isinstance(l, list) for l in t):
            return [sort_newick(l) for l in sorted(t, key=lambda k: sorted(k))]
        else:
            return [sort_newick(l) for l in t]


def canonical_newick(n):
    n = n.replace(',', '')
    p = nestedExpr().parseString(n).asList()
    s = sort_newick(p)
    return str(s)

This gave for

from collections import defaultdict

all_newicks = itertools_newicks + enum_newicks

d = defaultdict(list)
for newick in all_newicks:
    d[canonical_newick(newick)].append(newick)

for canonical, newicks in d.items():
    print canonical
    for newick in newicks:
        print '\t', newick
    print

A dictionary with 22 keys:

[[[['A'], [['C'], ['B', 'D']]], ['E']]]
    ((A, (C, (B, D))),(E));

[[[['B'], [['A'], ['C', 'D']]], ['E']]]
    ((B, (A, (C, D))),(E));

[[[['B'], [['A', 'C'], ['D']]], ['E']]]
    ((B, ((A, C), D)),(E));

[[['A', 'C', 'D'], ['B'], ['E']]]
    ((B), (A, C, D),(E));
    ((A, C, D), (B),(E));
    ((A, D, C), (B),(E));
    ((C, A, D), (B),(E));
    ((C, D, A), (B),(E));
    ((D, A, C), (B),(E));
    ((D, C, A), (B),(E));

[[['A', 'B'], ['C', 'D'], ['E']]]
    ((A, B), (C, D),(E));
    ((B, A), (C, D),(E));
    ((C, D), (A, B),(E));
    ((D, C), (A, B),(E));

[[[[['A'], ['B', 'C']], ['D']], ['E']]]
    (((A, (B, C)), D),(E));

[[[['A', 'C'], ['B', 'D']], ['E']]]
    (((A, C), (B, D)),(E));

[[['A'], ['B', 'C', 'D'], ['E']]]
    ((A), (B, C, D),(E));
    ((B, C, D), (A),(E));
    ((B, D, C), (A),(E));
    ((C, B, D), (A),(E));
    ((C, D, B), (A),(E));
    ((D, B, C), (A),(E));
    ((D, C, B), (A),(E));

[[[['A', 'D'], ['B', 'C']], ['E']]]
    (((B, C), (A, D)),(E));

[[['A', 'B', 'C'], ['D'], ['E']]]
    ((D), (A, B, C),(E));
    ((A, B, C), (D),(E));
    ((A, C, B), (D),(E));
    ((B, A, C), (D),(E));
    ((B, C, A), (D),(E));
    ((C, A, B), (D),(E));
    ((C, B, A), (D),(E));

[[['A', 'C'], ['B', 'D'], ['E']]]
    ((A, C), (B, D),(E));
    ((B, D), (A, C),(E));
    ((C, A), (B, D),(E));
    ((D, B), (A, C),(E));

[[['A', 'B', 'D'], ['C'], ['E']]]
    ((C), (A, B, D),(E));
    ((A, B, D), (C),(E));
    ((A, D, B), (C),(E));
    ((B, A, D), (C),(E));
    ((B, D, A), (C),(E));
    ((D, A, B), (C),(E));
    ((D, B, A), (C),(E));

[[[['A'], [['B'], ['C', 'D']]], ['E']]]
    ((A, (B, (C, D))),(E));

[[[[['A', 'B'], ['C']], ['D']], ['E']]]
    ((((A, B), C), D),(E));

[[[[['B'], ['A', 'C']], ['D']], ['E']]]
    (((B, (A, C)), D),(E));

[[[['C'], [['B'], ['A', 'D']]], ['E']]]
    ((C, (B, (A, D))),(E));

[[[['C'], [['A', 'B'], ['D']]], ['E']]]
    ((C, ((A, B), D)),(E));

[[[['A'], [['B', 'C'], ['D']]], ['E']]]
    ((A, ((B, C), D)),(E));

[[[['A', 'B'], ['C', 'D']], ['E']]]
    (((A, B), (C, D)),(E));

[[[['B'], [['C'], ['A', 'D']]], ['E']]]
    ((B, (C, (A, D))),(E));

[[[['C'], [['A'], ['B', 'D']]], ['E']]]
    ((C, (A, (B, D))),(E));

[[['A', 'D'], ['B', 'C'], ['E']]]
    ((A, D), (B, C),(E));
    ((B, C), (A, D),(E));
    ((C, B), (A, D),(E));
    ((D, A), (B, C),(E));

But closer inspection revealed some problems. Let's look for example at the newicks '(((A, B), (C, D)),(E)); and ((D, C), (A, B),(E));. In our dictionary d they have a different canonical key, respectively [[[['A', 'B'], ['C', 'D']], ['E']]] and [[['A', 'B'], ['C', 'D'], ['E']]]. But in fact, these are duplicate trees. We can confirm this by looking at the Robinson-Foulds distance between two trees. If it is zero, the trees are identical.

We use the robinson_foulds function from the ete3 toolkit package

from ete3 import Tree

tree1 = Tree('(((A, B), (C, D)),(E));')
tree2 = Tree('((D, C), (A, B),(E));')

rf, max_parts, common_attrs, edges1, edges2, discard_t1, discard_t2 = tree1.robinson_foulds(tree2, unrooted_trees=True)
    print rf # returns 0

OK, so Robinson-Foulds is a better way of checking for newick tree equality then my canonical tree approach. Let's wrap all newick strings in a custom MyTree object where equality is defined as having a Robinson-Foulds distance of zero:

class MyTree(Tree):

    def __init__(self, *args, **kwargs):
        super(MyTree, self).__init__(*args, **kwargs)

    def __eq__(self, other):
        rf = self.robinson_foulds(other, unrooted_trees=True)
        return not bool(rf[0])

trees = [MyTree(newick) for newick in all_newicks]

It would have been ideal if we could also define a __hash__() function that returns the same value for duplicate trees, then set(trees) would automatically remove all the duplicates.

Unfortunately, I haven't been able to find a good way to define __hash__(), but with __eq__ in place, I could make use of index():

unique_trees = [trees[i] for i in range(len(trees)) if i == trees.index(trees[i])]
unique_newicks = [tree.write(format=9) for tree in unique_trees]
for unique_newick in unique_newicks:
    print unique_newick

So, here we are at the end of our journey. I can't fully provide proof that this is the correct solution, but I am pretty confident that the following 19 newicks are all the possible distinct permutations:

((A),(B,C,D),(E));
((B),(A,C,D),(E));
((C),(A,B,D),(E));
((D),(A,B,C),(E));
((A,B),(C,D),(E));
((A,C),(B,D),(E));
((A,D),(B,C),(E));
((A,(B,(C,D))),(E));
((B,(A,(C,D))),(E));
((B,((A,C),D)),(E));
((B,(C,(A,D))),(E));
((A,((B,C),D)),(E));
(((A,(B,C)),D),(E));
((((A,B),C),D),(E));
(((B,(A,C)),D),(E));
((A,(C,(B,D))),(E));
((C,(A,(B,D))),(E));
((C,((A,B),D)),(E));
((C,(B,(A,D))),(E));

If we pairwise compare every newick to all other newicks, we get confirmation that there are no more duplicates in this list

from itertools import product

for n1, n2 in product(unique_newicks, repeat=2):
    if n1 != n2:
        mt1 = MyTree(n1)
        mt2 = MyTree(n2)
        assert mt1 != mt2
查看更多
3楼-- · 2019-04-07 23:55

A tree as a recursive set of sets of leaves

Let's set aside the newick representation for the moment, and think of a possible python representation of the problem.

A rooted tree can be viewed as a recursive hierarchy of sets of (sets of (sets of ...)) leaves. Sets are unordered, which is quite adapted to describe clades in a tree: {{{"A", "B"}, {"C", "D"}}, "E"} should be the same thing as {{{"C", "D"}, {"B", "A"}}, "E"}.

If we consider the initial set of leaves {"A", "B", "C", "D", "E"}, the trees with "E" as outgroup are the set of sets in the form {tree, "E"} where trees are taken from the set of trees that can be built from the set of leaves {"A", "B", "C", "D"}. We could try to write a recursive trees function to generate this set of trees, and our total set of trees would be expressed as follows:

{{tree, "E"} for tree in trees({"A", "B", "C", "D"})}

(Here, I use the set comprehension notation.)

Actually, python doesn't allow sets of sets, because the elements of a set must be "hashable" (that is, python must be able to compute some "hash" values of objects to be able to check whether they belong or not to the set). It happens that python sets do not have this property. Fortunately, we can use a similar data structure named frozenset, which behaves quite like a set, but cannot be modified and is "hashable". Therefore, our set of trees would be:

all_trees = frozenset(
    {frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})

Implementing the trees function

Now let's focus on the trees function.

For each possible partition (decomposition into a set of disjoint subsets, including all elements) of the set of leaves, we need to find all possible trees (through a recursive call) for each part of the partition. For a given partition, we will then make a tree for each possible combination of subtrees taken across its parts.

For instance, if a partition is {"A", {"B", "C", "D"}}, we will consider all possible trees that can be made from part "A" (actually, just the leaf "A" itself), and all possible trees that can be made from part {"B", "C", "D"} (that is, trees({"B", "C", "D"})). Then, the possible trees for this partition will be obtained by taking all possible pairs where one element comes from just "A", and the other from trees({"B", "C", "D"}).

This can be generalized for partitions with more than two parts, and the product function from itertools seems to be useful here.

Therefore, we need a way to generate the possible partitions of a set of leaves.

Generating partitions of a set

Here I made a partitions_of_set function adapted from this solution:

# According to https://stackoverflow.com/a/30134039/1878788:
# The problem is solved recursively:
# If you already have a partition of n-1 elements, how do you use it to partition n elements?
# Either place the n'th element in one of the existing subsets, or add it as a new, singleton subset.
def partitions_of_set(s):
    if len(s) == 1:
        yield frozenset(s)
        return
    # Extract one element from the set
    # https://stackoverflow.com/a/43804050/1878788
    elem, *_ = s
    rest = frozenset(s - {elem})
    for partition in partitions_of_set(rest):
        for subset in partition:
            # Insert the element in the subset
            try:
                augmented_subset = frozenset(subset | frozenset({elem}))
            except TypeError:
                # subset is actually an atomic element
                augmented_subset = frozenset({subset} | frozenset({elem}))
            yield frozenset({augmented_subset}) | (partition - {subset})
        # Case with the element in its own extra subset
        yield frozenset({elem}) | partition

To check the obtained partitions, we make a function to make them easier to display (that will also be useful to make a newick representation of the trees later):

def print_set(f):
    if type(f) not in (set, frozenset):
        return str(f)
    return "(" + ",".join(sorted(map(print_set, f))) + ")"

We test that the partitioning works:

for partition in partitions_of_set({"A", "B", "C", "D"}):
    print(len(partition), print_set(partition))

Output:

1 ((A,B,C,D))
2 ((A,B,D),C)
2 ((A,C),(B,D))
2 ((B,C,D),A)
3 ((B,D),A,C)
2 ((A,B,C),D)
2 ((A,B),(C,D))
3 ((A,B),C,D)
2 ((A,D),(B,C))
2 ((A,C,D),B)
3 ((A,D),B,C)
3 ((A,C),B,D)
3 ((B,C),A,D)
3 ((C,D),A,B)
4 (A,B,C,D)

Actual code of the trees function

Now we can write the tree function:

from itertools import product
def trees(leaves):
    if type(leaves) not in (set, frozenset):
        # It actually is a single leaf
        yield leaves
        # Don't try to yield any more trees
        return
    # Otherwise, we will have to consider all the possible
    # partitions of the set of leaves, and for each partition,
    # construct the possible trees for each part
    for partition in partitions_of_set(leaves):
        # We need to skip the case where the partition
        # has only one subset (the initial set itself),
        # otherwise we will try to build an infinite
        # succession of nodes with just one subtree
        if len(partition) == 1:
            part, *_ = partition
            # Just to be sure the assumption is correct
            assert part == leaves
            continue
        # We recursively apply *tree* to each part
        # and obtain the possible trees by making
        # the product of the sets of possible subtrees.
        for subtree in product(*map(trees, partition)):
            # Using a frozenset guarantees
            # that there will be no duplicates
            yield frozenset(subtree)

Testing it:

all_trees = frozenset(
    {frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})

for tree in all_trees:
    print(print_set(tree) + ";")

Output:

(((B,C),A,D),E);
((((A,B),D),C),E);
((((B,D),A),C),E);
((((C,D),A),B),E);
(((A,D),B,C),E);
((A,B,C,D),E);
((((B,D),C),A),E);
(((A,B,C),D),E);
((((A,C),B),D),E);
((((C,D),B),A),E);
((((B,C),A),D),E);
(((A,B),C,D),E);
(((A,C),(B,D)),E);
(((B,D),A,C),E);
(((C,D),A,B),E);
((((A,B),C),D),E);
((((A,C),D),B),E);
(((A,C,D),B),E);
(((A,D),(B,C)),E);
((((A,D),C),B),E);
((((B,C),D),A),E);
(((A,B),(C,D)),E);
(((A,B,D),C),E);
((((A,D),B),C),E);
(((A,C),B,D),E);
(((B,C,D),A),E);

I hope the result is correct.

This approach was a bit tricky to get right. It took me some time to figure out how to avoid the infinite recursion (This happens when the partition is {{"A", "B", "C", "D"}}).

查看更多
登录 后发表回答