Writing nested dictionary (forest) of a huge depth

2019-07-13 12:34发布

Continued to my older question: Writing nested dictionary (forest) of a huge depth to a text file

Now I want to write the forest traversal in the BFS style: I have a huge depth dictionary that represents forest (many non-binary trees) which I want to process the forest and create a text file with the sequences of (father, son) relations from the forest, i.e. given the dictionary:

{'a': {'b': {'c': {'x': {}}, 'd': {'p': {}}}, 'g': {}, 'f': {}},
 't': {'r': {'o': {}}, 'y': {}}}

the generated text file will look like:

(ROOT,b) (ROOT,g) (ROOT,f) (b,c) (b,d) (c,x) (d,p) \n
(ROOT,r) (ROOT,y) (r,o) \n

Note that I replaces all the roots in the forest with the word "ROOT".

Here is a simple visualization of the forest: forest vosualization

The nested dictionary is big and iterating over it recursively makes a memory run-time error, so a "Generator style" solution as in the link at the beginning of this question would be the best.

3条回答
走好不送
2楼-- · 2019-07-13 13:07

It is easiest to traverse the structure recursively with a generator:

def flatten_forest(forest, write=True):
  def flatten(d, seen = None):
    for a, b in d.items():
      if seen is None:
       yield ('ROOT', a)
      else:
        yield (seen, a)
      if b:
        yield from flatten(b, a)
  if write:
    with open('full_flattened_tree.txt', 'a') as f:
      f.write(' '.join(map(str, flatten(forest)))+'\n')

data = {'a': {'b': {'c': {'x': {}}, 'd': {'p': {}}}, 'g': {}, 'f': {}}, 't': {'r': {'o': {}}, 'y': {}}}
for i in data.values():
  flatten_forest(i)

File output:

('ROOT', 'b') ('b', 'c') ('c', 'x') ('b', 'd') ('d', 'p') ('ROOT', 'g') ('ROOT', 'f')
('ROOT', 'r') ('r', 'o') ('ROOT', 'y')

This will work on large dictionaries:

import random, string, time
def create_structure(_len, _depth = 5, _count = 0):
 return {string.ascii_lowercase[i]:{} if _depth == _count else create_structure(random.randint(1, 26), _count = _count + 1) for i in range(_len)}

d = create_structure(26)
c = time.time()
flatten_forest(d, write=True)
print(time.time()-c)

Output:

11.871491193771362
查看更多
聊天终结者
3楼-- · 2019-07-13 13:14
d = {'a': {'b': {'c': {'x': {}}, 'd': {'p': {}}}, 'g': {}, 'f': {}}, 't': {'r': {'o': {}}, 'y': {}}}
with open('file', 'w') as f:
    for r, s in d.items():
        q = []
        p = r
        while True:
            for k, v in s.items():
                f.write('(%s,%s) ' % ('ROOT' if p == r else p, k))
                if v:
                    q.append((k, v))
            if not q:
                break
            p, s = q.pop(0)
        f.write('\n')

This outputs:

(ROOT,b) (ROOT,g) (ROOT,f) (b,c) (b,d) (c,x) (d,p) 
(ROOT,r) (ROOT,y) (r,o) 
查看更多
Anthone
4楼-- · 2019-07-13 13:14

To perform breadth-first-search, we must keep a list of the current working nodes and the trees below them - I chose to store these in a tuple.

For instance, when we are working at the depth of the c and d nodes, this list of trees will be:

[('c': {'x': {}}), ('d': {'p': {}})]

Now while there are still trees below us (while len(trees):), we need to step down to the layer below in our tree.

The fist step is obviously to reset the trees list since we will be generating the next layer.

We then iterate over our list of trees and, for each tree, we iterate over its children.

So taking the above example, on the first iteration, the node would be 'c' and the children would be: {'x': {}} and we now want to iterate over the children. So on the first iteration of that children loop, the first child node will be 'x' and its children (c's child's children) are empty: {}.

Now, in this scope (the node's child's), if the child has children we want to add the child and its children (again, as a tuple) to the list of trees.

So to give an example where are there are children, when the current node is b, then one of its children is c and since c has children, a tuple of (c, c's children) is appended to the trees list for the next layer.

Finally, no matter whether this child had children or not, we want to the current line in the file the link between us and them. This is (node, child_node).

And that's almost it. Of course, when we finish a whole tree, we need to write a new-line to the file.

The only annoying detail is the issue of spaces between the tuples written to the file. If we always concatenated a space to the end of each tuple, we would end up with a stray space on the end of each line, as seen below which isn't ideal.

(ROOT, a)S(a,b)S

(where S represents a space)

So to redeem this, we will always concatenate a space before each tuple so long as we are not the first on a new line (line_first). For this to work, at the start of each tree (line), we set the line_first flag to True, but then in the code, we immediately set it to False on the first iteration (but skip writing the space), otherwise (future tuples) we write a space before.

And that's it. Here's the completed code:

the_tree = {'a': {'b': {'c': {'x': {}}, 'd': {'p': {}}}, 'g': {}, 'f': {}},
            't': {'r': {'o': {}}, 'y': {}}}

with open('the_file', 'w') as file:
    for tree in the_tree.values():
        line_first = True
        trees = [('ROOT', tree)]
        while len(trees):
            new_trees = []
            for node, children in trees:
                for child_node, child_children in children.items():
                    if child_children:
                        new_trees.append((child_node, child_children))
                    if line_first: line_first = False
                    else: file.write(' ')
                    file.write(f'({node}, {child_node})')
            trees = new_trees
        file.write('\n')

warning: uses f-strings which were introduced in version 3.6!


And it produces the intended output:

(ROOT, b) (ROOT, g) (ROOT, f) (b, c) (b, d) (c, x) (d, p)
(ROOT, r) (ROOT, y) (r, o)
查看更多
登录 后发表回答