可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Basically, I want to iterate through a file and put the contents of each line into a deeply nested dict, the structure of which is defined by the amount of whitespace at the start of each line.
Essentially the aim is to take something like this:
a
b
c
d
e
And turn it into something like this:
{"a":{"b":"c","d":"e"}}
Or this:
apple
colours
red
yellow
green
type
granny smith
price
0.10
into this:
{"apple":{"colours":["red","yellow","green"],"type":"granny smith","price":0.10}
So that I can send it to Python's JSON module and make some JSON.
At the moment I'm trying to make a dict and a list in steps like such:
{"a":""} ["a"]
{"a":"b"} ["a"]
{"a":{"b":"c"}} ["a","b"]
{"a":{"b":{"c":"d"}}}} ["a","b","c"]
{"a":{"b":{"c":"d"},"e":""}} ["a","e"]
{"a":{"b":{"c":"d"},"e":"f"}} ["a","e"]
{"a":{"b":{"c":"d"},"e":{"f":"g"}}} ["a","e","f"]
etc.
The list acts like 'breadcrumbs' showing where I last put in a dict.
To do this I need a way to iterate through the list and generate something like dict["a"]["e"]["f"]
to get at that last dict. I've had a look at the AutoVivification class that someone has made which looks very useful however I'm really unsure of:
- Whether I'm using the right data structure for this (I'm planning to send it to the JSON library to create a JSON object)
- How to use AutoVivification in this instance
- Whether there's a better way in general to approach this problem.
I came up with the following function but it doesn't work:
def get_nested(dict,array,i):
if i != None:
i += 1
if array[i] in dict:
return get_nested(dict[array[i]],array)
else:
return dict
else:
i = 0
return get_nested(dict[array[i]],array)
Would appreciate help!
(The rest of my extremely incomplete code is here:)
#Import relevant libraries
import codecs
import sys
#Functions
def stripped(str):
if tab_spaced:
return str.lstrip('\t').rstrip('\n\r')
else:
return str.lstrip().rstrip('\n\r')
def current_ws():
if whitespacing == 0 or not tab_spaced:
return len(line) - len(line.lstrip())
if tab_spaced:
return len(line) - len(line.lstrip('\t\n\r'))
def get_nested(adict,anarray,i):
if i != None:
i += 1
if anarray[i] in adict:
return get_nested(adict[anarray[i]],anarray)
else:
return adict
else:
i = 0
return get_nested(adict[anarray[i]],anarray)
#initialise variables
jsondict = {}
unclosed_tags = []
debug = []
vividfilename = 'simple.vivid'
# vividfilename = sys.argv[1]
if len(sys.argv)>2:
jsfilename = sys.argv[2]
else:
jsfilename = vividfilename.split('.')[0] + '.json'
whitespacing = 0
whitespace_array = [0,0]
tab_spaced = False
#open the file
with codecs.open(vividfilename,'rU', "utf-8-sig") as vividfile:
for line in vividfile:
#work out how many whitespaces at start
whitespace_array.append(current_ws())
#For first line with whitespace, work out the whitespacing (eg tab vs 4-space)
if whitespacing == 0 and whitespace_array[-1] > 0:
whitespacing = whitespace_array[-1]
if line[0] == '\t':
tab_spaced = True
#strip out whitespace at start and end
stripped_line = stripped(line)
if whitespace_array[-1] == 0:
jsondict[stripped_line] = ""
unclosed_tags.append(stripped_line)
if whitespace_array[-2] < whitespace_array[-1]:
oldnested = get_nested(jsondict,whitespace_array,None)
print oldnested
# jsondict.pop(unclosed_tags[-1])
# jsondict[unclosed_tags[-1]]={stripped_line:""}
# unclosed_tags.append(stripped_line)
print jsondict
print unclosed_tags
print jsondict
print unclosed_tags
回答1:
Here is a recursive solution. First, transform the input in the following way.
Input:
person:
address:
street1: 123 Bar St
street2:
city: Madison
state: WI
zip: 55555
web:
email: boo@baz.com
First-step output:
[{'name':'person','value':'','level':0},
{'name':'address','value':'','level':1},
{'name':'street1','value':'123 Bar St','level':2},
{'name':'street2','value':'','level':2},
{'name':'city','value':'Madison','level':2},
{'name':'state','value':'WI','level':2},
{'name':'zip','value':55555,'level':2},
{'name':'web','value':'','level':1},
{'name':'email','value':'boo@baz.com','level':2}]
This is easy to accomplish with split(':')
and by counting the number of leading tabs:
def tab_level(astr):
"""Count number of leading tabs in a string
"""
return len(astr)- len(astr.lstrip('\t'))
Then feed the first-step output into the following function:
def ttree_to_json(ttree,level=0):
result = {}
for i in range(0,len(ttree)):
cn = ttree[i]
try:
nn = ttree[i+1]
except:
nn = {'level':-1}
# Edge cases
if cn['level']>level:
continue
if cn['level']<level:
return result
# Recursion
if nn['level']==level:
dict_insert_or_append(result,cn['name'],cn['value'])
elif nn['level']>level:
rr = ttree_to_json(ttree[i+1:], level=nn['level'])
dict_insert_or_append(result,cn['name'],rr)
else:
dict_insert_or_append(result,cn['name'],cn['value'])
return result
return result
where:
def dict_insert_or_append(adict,key,val):
"""Insert a value in dict at key if one does not exist
Otherwise, convert value to list and append
"""
if key in adict:
if type(adict[key]) != list:
adict[key] = [adict[key]]
adict[key].append(val)
else:
adict[key] = val
回答2:
The following code will take a block-indented file and convert into an XML tree; this:
foo
bar
baz
ban
bal
...becomes:
<cmd>foo</cmd>
<cmd>bar</cmd>
<block>
<name>baz</name>
<cmd>ban</cmd>
<cmd>bal</cmd>
</block>
The basic technique is:
- Set indent to 0
- For each line, get the indent
- If > current, step down and save current block/ident on a stack
- If == current, append to current block
- If < current, pop from the stack until you get to the matching indent
So:
from lxml import builder
C = builder.ElementMaker()
def indent(line):
strip = line.lstrip()
return len(line) - len(strip), strip
def parse_blockcfg(data):
top = current_block = C.config()
stack = []
current_indent = 0
lines = data.split('\n')
while lines:
line = lines.pop(0)
i, line = indent(line)
if i==current_indent:
pass
elif i > current_indent:
# we've gone down a level, convert the <cmd> to a block
# and then save the current ident and block to the stack
prev.tag = 'block'
prev.append(C.name(prev.text))
prev.text = None
stack.insert(0, (current_indent, current_block))
current_indent = i
current_block = prev
elif i < current_indent:
# we've gone up one or more levels, pop the stack
# until we find out which level and return to it
found = False
while stack:
parent_indent, parent_block = stack.pop(0)
if parent_indent==i:
found = True
break
if not found:
raise Exception('indent not found in parent stack')
current_indent = i
current_block = parent_block
prev = C.cmd(line)
current_block.append(prev)
return top
回答3:
Here is an object oriented approach based on a composite structure of nested Node
objects.
Input:
indented_text = \
"""
apple
colours
red
yellow
green
type
granny smith
price
0.10
"""
a Node class
class Node:
def __init__(self, indented_line):
self.children = []
self.level = len(indented_line) - len(indented_line.lstrip())
self.text = indented_line.strip()
def add_children(self, nodes):
childlevel = nodes[0].level
while nodes:
node = nodes.pop(0)
if node.level == childlevel: # add node as a child
self.children.append(node)
elif node.level > childlevel: # add nodes as grandchildren of the last child
nodes.insert(0,node)
self.children[-1].add_children(nodes)
elif node.level <= self.level: # this node is a sibling, no more children
nodes.insert(0,node)
return
def as_dict(self):
if len(self.children) > 1:
return {self.text: [node.as_dict() for node in self.children]}
elif len(self.children) == 1:
return {self.text: self.children[0].as_dict()}
else:
return self.text
To parse the text, first create a root node.
Then, remove empty lines from the text, and create a Node
instance for every line, pass this to the add_children
method of the root node.
root = Node('root')
root.add_children([Node(line) for line in indented_text.splitlines() if line.strip()])
d = root.as_dict()['root']
print(d)
result:
{'apple': [
{'colours': ['red', 'yellow', 'green']},
{'type': 'granny smith'},
{'price': '0.10'}]
}
I think that it should be possible to do it in one step, where you simply call the constructor of Node
once, with the indented text as an argument.
回答4:
First of all, don't use array
and dict
as variable names because they're reserved words in Python and reusing them may end up in all sorts of chaos.
OK so if I get you correctly, you have a tree given in a text file, with parenthood indicated by indentations, and you want to recover the actual tree structure. Right?
Does the following look like a valid outline? Because I have trouble putting your current code into context.
result = {}
last_indentation = 0
for l in f.xreadlines():
(c, i) = parse(l) # create parse to return character and indentation
if i==last_indentation:
# sibling to last
elif i>last_indentation:
# child to last
else:
# end of children, back to a higher level
OK then your list are the current parents, that's in fact right - but I'd keep them pointed to the dictionary you've created, not the literal letter
just starting some stuff here
result = {}
parents = {}
last_indentation = 1 # start with 1 so 0 is the root of tree
parents[0] = result
for l in f.xreadlines():
(c, i) = parse(l) # create parse to return character and indentation
if i==last_indentation:
new_el = {}
parents[i-1][c] = new_el
parents[i] = new_el
elif i>last_indentation:
# child to last
else:
# end of children, back to a higher level