Processing repeatedly structured text file with py

2019-01-18 16:10发布

问题:

I have a big text file structured in blocks like:

Student = {
        PInfo = {
                ID   = 0001;
            Name.First = "Joe";
            Name.Last = "Burger";
            DOB  = "01/01/2000";
        };
        School = "West High";
        Address = {
            Str1 = "001 Main St.";
            Zip = 12345;
        };
    };
    Student = {
        PInfo = {
            ID   = 0002;
            Name.First = "John";
            Name.Last = "Smith";
            DOB  = "02/02/2002";
        };
        School = "East High";
        Address = {
            Str1 = "001 40nd St.";
            Zip = 12346;
        };
        Club = "Football";
    };
    ....

The Student blocks share the same entries like "PInfo", "School" and "Address", but some of them may have additional entries, such as the "Club" information for "John Smith" which is not included for "Joe Burger". What I want to do is to get Name, School name and zip code of each student and store them in a dictionary, like

    {'Joe Burger':{School:'West High', Zip:12345}, 'John Smith':{School:'East High', Zip:12346}, ...}

Being new to python programming, I tried to open the file and analyze it line by line, but it looks so cumbersome. And the real file is quite large and more complicated than the example I posted above. I am wondering if there is an easier way to do it. Thanks ahead.

回答1:

To parse the file you could define a grammar that describes your input format and use it to generate a parser.

There are many language parsers in Python. For example, you could use Grako that takes grammars in a variation of EBNF as input, and outputs memoizing PEG parsers in Python.

To install Grako, run pip install grako.

Here's grammar for your format using Grako's flavor of EBNF syntax:

(* a file is zero or more records *)
file = { record }* $;
record = name '=' value ';' ;
name = /[A-Z][a-zA-Z0-9.]*/ ;
value = object | integer | string ;
(* an object contains one or more records *)
object = '{' { record }+ '}' ;
integer = /[0-9]+/ ;
string = '"' /[^"]*/ '"';

To generate parser, save the grammar to a file e.g., Structured.ebnf and run:

$ grako -o structured_parser.py Structured.ebnf

It creates structured_parser module that can be used to extract the student information from the input:

#!/usr/bin/env python
from structured_parser import StructuredParser

class Semantics(object):
    def record(self, ast):
        # record = name '=' value ';' ;
        # value = object | integer | string ;
        return ast[0], ast[2] # name, value
    def object(self, ast):
        # object = '{' { record }+ '}' ;
        return dict(ast[1])
    def integer(self, ast):
        # integer = /[0-9]+/ ;
        return int(ast)
    def string(self, ast):
        # string = '"' /[^"]*/ '"';
        return ast[1]

with open('input.txt') as file:
    text = file.read()
parser = StructuredParser()
ast = parser.parse(text, rule_name='file', semantics=Semantics())
students = [value for name, value in ast if name == 'Student']
d = {'{0[Name.First]} {0[Name.Last]}'.format(s['PInfo']):
     dict(School=s['School'], Zip=s['Address']['Zip'])
     for s in students}
from pprint import pprint
pprint(d)

Output

{'Joe Burger': {'School': u'West High', 'Zip': 12345},
 'John Smith': {'School': u'East High', 'Zip': 12346}}


回答2:

it's not json, but similar structured. you should be able to reformat it into json.

  1. "=" -> ":"
  2. quote all keys with '"'
  3. ";" -> ","
  4. remove all "," which are followed by a "}"
  5. put it in curly braces
  6. parse it with json.loads


回答3:

For such thing, I use Marpa::R2, a Perl interface to Marpa, a general BNF parser. It allows decribing the text as a grammar rules and parse them to a tree of arrays (parse tree). You can then traverse the tree to save the results as a hash of hashes (hash is perl for python's dictionary) or use it as is.

I cooked a working example using your input: parser, result tree.

Hope this helps.

P.S. Example of ast_traverse(): Parse values from a block of text based on specific keys