Consider the following data structure:
[HEADER1]
{
key value
key value
...
[HEADER2]
{
key value
...
}
key value
[HEADER3]
{
key value
[HEADER4]
{
key value
...
}
}
key value
}
There are no indents in the raw data, but I added them here for clarity. The number of key-value pairs is unknown, '...' indicates there could be many more within each [HEADER] block. Also the amount of [HEADER] blocks is unknown.
Note that the structure is nested, so in this example header 2 and 3 are inside header 1 and header 4 is inside header 3.
There can be many more (nested) headers, but I kept the example short.
How do I go about parsing this into a nested dictionary structure? Each [HEADER] should be the key to whatever follows inside the curly brackets.
The final result should be something like:
dict = {'HEADER1': 'contents of 1'}
contents of 1 = {'key': 'value', 'key': 'value', 'HEADER2': 'contents of 2', etc}
I'm guessing I need some sort of recursive function, but I am pretty new to Python and have no idea where to start.
For starters, I can pull out all the [HEADER] keys as follows:
path = 'mydatafile.txt'
keys = []
with open (path, 'rt') as file:
for line in file:
if line.startswith('['):
keys.append(line.rstrip('\n'))
for key in keys:
print(key)
But then what, maybe this not even needed?
Any suggestions?
You can do it by pre-formatting your file content using few regex and then pass it to
json.loads
You can do these kind of regex substitutions one by one:
#1
\[(\w*)\]\n
->"$1":
#2
\}\n(\w)
->},$1
#3
(\w*)\s(\w*)\n([^}])
->$1:$2,$3
#4
(\w*)\s(\w*)\n\}
->$1:$2}
and then finally pass the final string to
json.loads
:which will parse it to a dict format.
Explanation :
1.
\[(\w*)\]\n
: replace[HEADERS]\n
with"HEADERS":
2.
\}\n(\w)
: replace any closing braces i.e,}
that have any value after them, with},
3.
(\w*)\s(\w*)\n([^}])
: replacekey value\n
withkey:value,
for lines having any next elements4.
(\w*)\s(\w*)\n\}
: replacekey value\n
withkey:value
for lines having no next elementsSo, by minor modifications to these regexes you will be able to parse it to a dict format, the basic concept is to reformat the file contents to a format that can be parsed easily.
I would convert [HEADER1] to "header": "HEADER1"
key value to "key": " value"
Enclose all with {} And finally parse it with json library You can convert it with sed