Creating a tree/deeply nested dict from an indente

2019-01-17 17:59发布

问题:

Basically, I want to iterate through a file and put the contents of each line into a deeply nested dict, the structure of which is defined by the amount of whitespace at the start of each line.

Essentially the aim is to take something like this:

a
    b
        c
    d
        e

And turn it into something like this:

{"a":{"b":"c","d":"e"}}

Or this:

apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10

into this:

{"apple":{"colours":["red","yellow","green"],"type":"granny smith","price":0.10}

So that I can send it to Python's JSON module and make some JSON.

At the moment I'm trying to make a dict and a list in steps like such:

  1. {"a":""} ["a"]
  2. {"a":"b"} ["a"]
  3. {"a":{"b":"c"}} ["a","b"]
  4. {"a":{"b":{"c":"d"}}}} ["a","b","c"]
  5. {"a":{"b":{"c":"d"},"e":""}} ["a","e"]
  6. {"a":{"b":{"c":"d"},"e":"f"}} ["a","e"]
  7. {"a":{"b":{"c":"d"},"e":{"f":"g"}}} ["a","e","f"]

etc.

The list acts like 'breadcrumbs' showing where I last put in a dict.

To do this I need a way to iterate through the list and generate something like dict["a"]["e"]["f"] to get at that last dict. I've had a look at the AutoVivification class that someone has made which looks very useful however I'm really unsure of:

  1. Whether I'm using the right data structure for this (I'm planning to send it to the JSON library to create a JSON object)
  2. How to use AutoVivification in this instance
  3. Whether there's a better way in general to approach this problem.

I came up with the following function but it doesn't work:

def get_nested(dict,array,i):
if i != None:
    i += 1
    if array[i] in dict:
        return get_nested(dict[array[i]],array)
    else:
        return dict
else:
    i = 0
    return get_nested(dict[array[i]],array)

Would appreciate help!

(The rest of my extremely incomplete code is here:)

#Import relevant libraries
import codecs
import sys

#Functions
def stripped(str):
    if tab_spaced:
        return str.lstrip('\t').rstrip('\n\r')
    else:
        return str.lstrip().rstrip('\n\r')

def current_ws():
    if whitespacing == 0 or not tab_spaced:
        return len(line) - len(line.lstrip())
    if tab_spaced:
        return len(line) - len(line.lstrip('\t\n\r'))

def get_nested(adict,anarray,i):
    if i != None:
        i += 1
        if anarray[i] in adict:
            return get_nested(adict[anarray[i]],anarray)
        else:
            return adict
    else:
        i = 0
        return get_nested(adict[anarray[i]],anarray)

#initialise variables
jsondict = {}
unclosed_tags = []
debug = []

vividfilename = 'simple.vivid'
# vividfilename = sys.argv[1]
if len(sys.argv)>2:
    jsfilename = sys.argv[2]
else:
    jsfilename = vividfilename.split('.')[0] + '.json'

whitespacing = 0
whitespace_array = [0,0]
tab_spaced = False

#open the file
with codecs.open(vividfilename,'rU', "utf-8-sig") as vividfile:
    for line in vividfile:
        #work out how many whitespaces at start
        whitespace_array.append(current_ws())

        #For first line with whitespace, work out the whitespacing (eg tab vs 4-space)
        if whitespacing == 0 and whitespace_array[-1] > 0:
            whitespacing = whitespace_array[-1]
            if line[0] == '\t':
                tab_spaced = True

        #strip out whitespace at start and end
        stripped_line = stripped(line)

        if whitespace_array[-1] == 0:
            jsondict[stripped_line] = ""
            unclosed_tags.append(stripped_line)

        if whitespace_array[-2] < whitespace_array[-1]:
            oldnested = get_nested(jsondict,whitespace_array,None)
            print oldnested
            # jsondict.pop(unclosed_tags[-1])
            # jsondict[unclosed_tags[-1]]={stripped_line:""}
            # unclosed_tags.append(stripped_line)

        print jsondict
        print unclosed_tags

print jsondict
print unclosed_tags

回答1:

Here is a recursive solution. First, transform the input in the following way.

Input:

person:
    address:
        street1: 123 Bar St
        street2: 
        city: Madison
        state: WI
        zip: 55555
    web:
        email: boo@baz.com

First-step output:

[{'name':'person','value':'','level':0},
 {'name':'address','value':'','level':1},
 {'name':'street1','value':'123 Bar St','level':2},
 {'name':'street2','value':'','level':2},
 {'name':'city','value':'Madison','level':2},
 {'name':'state','value':'WI','level':2},
 {'name':'zip','value':55555,'level':2},
 {'name':'web','value':'','level':1},
 {'name':'email','value':'boo@baz.com','level':2}]

This is easy to accomplish with split(':') and by counting the number of leading tabs:

def tab_level(astr):
    """Count number of leading tabs in a string
    """
    return len(astr)- len(astr.lstrip('\t'))

Then feed the first-step output into the following function:

def ttree_to_json(ttree,level=0):
    result = {}
    for i in range(0,len(ttree)):
        cn = ttree[i]
        try:
            nn  = ttree[i+1]
        except:
            nn = {'level':-1}

        # Edge cases
        if cn['level']>level:
            continue
        if cn['level']<level:
            return result

        # Recursion
        if nn['level']==level:
            dict_insert_or_append(result,cn['name'],cn['value'])
        elif nn['level']>level:
            rr = ttree_to_json(ttree[i+1:], level=nn['level'])
            dict_insert_or_append(result,cn['name'],rr)
        else:
            dict_insert_or_append(result,cn['name'],cn['value'])
            return result
    return result

where:

def dict_insert_or_append(adict,key,val):
    """Insert a value in dict at key if one does not exist
    Otherwise, convert value to list and append
    """
    if key in adict:
        if type(adict[key]) != list:
            adict[key] = [adict[key]]
        adict[key].append(val)
    else:
        adict[key] = val


回答2:

The following code will take a block-indented file and convert into an XML tree; this:

foo
bar
baz
  ban
  bal

...becomes:

<cmd>foo</cmd>
<cmd>bar</cmd>
<block>
  <name>baz</name>
  <cmd>ban</cmd>
  <cmd>bal</cmd>
</block>

The basic technique is:

  1. Set indent to 0
  2. For each line, get the indent
  3. If > current, step down and save current block/ident on a stack
  4. If == current, append to current block
  5. If < current, pop from the stack until you get to the matching indent

So:

from lxml import builder
C = builder.ElementMaker()

def indent(line):
    strip = line.lstrip()
    return len(line) - len(strip), strip

def parse_blockcfg(data):
    top = current_block = C.config()
    stack = []
    current_indent = 0

    lines = data.split('\n')
    while lines:
        line = lines.pop(0)
        i, line = indent(line)

        if i==current_indent:
            pass

        elif i > current_indent:
            # we've gone down a level, convert the <cmd> to a block
            # and then save the current ident and block to the stack
            prev.tag = 'block'
            prev.append(C.name(prev.text))
            prev.text = None
            stack.insert(0, (current_indent, current_block))
            current_indent = i
            current_block = prev

        elif i < current_indent:
            # we've gone up one or more levels, pop the stack
            # until we find out which level and return to it
            found = False
            while stack:
                parent_indent, parent_block = stack.pop(0)
                if parent_indent==i:
                    found = True
                    break
            if not found:
                raise Exception('indent not found in parent stack')
            current_indent = i
            current_block = parent_block

        prev = C.cmd(line)
        current_block.append(prev)

    return top


回答3:

Here is an object oriented approach based on a composite structure of nested Node objects.

Input:

indented_text = \
"""
apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10
"""

a Node class

class Node:
    def __init__(self, indented_line):
        self.children = []
        self.level = len(indented_line) - len(indented_line.lstrip())
        self.text = indented_line.strip()

    def add_children(self, nodes):
        childlevel = nodes[0].level
        while nodes:
            node = nodes.pop(0)
            if node.level == childlevel: # add node as a child
                self.children.append(node)
            elif node.level > childlevel: # add nodes as grandchildren of the last child
                nodes.insert(0,node)
                self.children[-1].add_children(nodes)
            elif node.level <= self.level: # this node is a sibling, no more children
                nodes.insert(0,node)
                return

    def as_dict(self):
        if len(self.children) > 1:
            return {self.text: [node.as_dict() for node in self.children]}
        elif len(self.children) == 1:
            return {self.text: self.children[0].as_dict()}
        else:
            return self.text

To parse the text, first create a root node. Then, remove empty lines from the text, and create a Node instance for every line, pass this to the add_children method of the root node.

root = Node('root')
root.add_children([Node(line) for line in indented_text.splitlines() if line.strip()])
d = root.as_dict()['root']
print(d)

result:

{'apple': [
  {'colours': ['red', 'yellow', 'green']},
  {'type': 'granny smith'},
  {'price': '0.10'}]
}

I think that it should be possible to do it in one step, where you simply call the constructor of Node once, with the indented text as an argument.



回答4:

First of all, don't use array and dict as variable names because they're reserved words in Python and reusing them may end up in all sorts of chaos.

OK so if I get you correctly, you have a tree given in a text file, with parenthood indicated by indentations, and you want to recover the actual tree structure. Right?

Does the following look like a valid outline? Because I have trouble putting your current code into context.

result = {}
last_indentation = 0
for l in f.xreadlines():
   (c, i) = parse(l) # create parse to return character and indentation
   if i==last_indentation:
   # sibling to last
   elif i>last_indentation:
   # child to last
   else:
   # end of children, back to a higher level

OK then your list are the current parents, that's in fact right - but I'd keep them pointed to the dictionary you've created, not the literal letter

just starting some stuff here

result = {}
parents = {}
last_indentation = 1 # start with 1 so 0 is the root of tree
parents[0] = result
for l in f.xreadlines():
   (c, i) = parse(l) # create parse to return character and indentation
   if i==last_indentation:
       new_el = {}
       parents[i-1][c] = new_el
       parents[i] = new_el
   elif i>last_indentation:
   # child to last
   else:
   # end of children, back to a higher level