Parse line data until keyword with pyparsing

2020-07-20 04:49发布

问题:

I'm trying to parse line data and then group them in list.

Here is my script:

from pyparsing import *

data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""

EOL = LineEnd().suppress()
start = Keyword('START').suppress() + EOL
end = Keyword('END').suppress() + EOL

line = SkipTo(LineEnd()) + EOL
lines = start + OneOrMore(start | end | Group(line))

start.setDebug()
end.setDebug()
line.setDebug()

result = lines.parseString(data)
results_list = result.asList()

print(results_list)

This code was inspired by another stackoverflow question: Matching nonempty lines with pyparsing

What I need is to parse everything from START to END line by line and save it to a list per group (everything from START to matching END is one group). However this script put every line in new group.

This is the result:

[['line 2'], ['line 3'], ['line 4'], ['line a'], ['line b'], ['line c'], ['']]

And I want it to be:

[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]

Also it parse an empty string at the end.

I'm a pyparsing beginner so I ask you for your help.

Thanks

回答1:

You could use a nestedExpr to find the text delimited by START and END.

If you use

In [322]: pp.nestedExpr('START', 'END').searchString(data).asList()
Out[322]: 
[[['line', '2', 'line', '3', 'line', '4']],
 [['line', 'a', 'line', 'b', 'line', 'c']]]

then the text is split on whitespace. (Notice we have 'line', '2' above where we want 'line 2' instead). We'd rather it just split only on '\n'. So to fix this we can use the pp.nestedExpr function's content parameter which allows us to control what is considered an item inside the nested list. The source code for nestedExpr defines

content = (Combine(OneOrMore(~ignoreExpr + 
                ~Literal(opener) + ~Literal(closer) +
                CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1))
            ).setParseAction(lambda t:t[0].strip()))

by default, where pp.ParserElement.DEFAULT_WHITE_CHARS is

In [324]: pp.ParserElement.DEFAULT_WHITE_CHARS
Out[324]: ' \n\t\r'

This is what causes nextExpr to split on all whitespace. So if we reduce that to simply '\n', then nestedExpr splits the content by lines instead of by all whitespace.


import pyparsing as pp

data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""

opener = 'START'
closer = 'END'
content = pp.Combine(pp.OneOrMore(~pp.Literal(opener) 
                                  + ~pp.Literal(closer) 
                                  + pp.CharsNotIn('\n',exact=1)))
expr = pp.nestedExpr(opener, closer, content=content)

result = [item[0] for item in expr.searchString(data).asList()]
print(result)

yields

[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]