I'm trying to parse line data and then group them in list.
Here is my script:
from pyparsing import *
data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""
EOL = LineEnd().suppress()
start = Keyword('START').suppress() + EOL
end = Keyword('END').suppress() + EOL
line = SkipTo(LineEnd()) + EOL
lines = start + OneOrMore(start | end | Group(line))
start.setDebug()
end.setDebug()
line.setDebug()
result = lines.parseString(data)
results_list = result.asList()
print(results_list)
This code was inspired by another stackoverflow question:
Matching nonempty lines with pyparsing
What I need is to parse everything from START to END line by line and save it to a list per group (everything from START to matching END is one group). However this script put every line in new group.
This is the result:
[['line 2'], ['line 3'], ['line 4'], ['line a'], ['line b'], ['line c'], ['']]
And I want it to be:
[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]
Also it parse an empty string at the end.
I'm a pyparsing beginner so I ask you for your help.
Thanks
You could use a nestedExpr
to find the text delimited by START
and END
.
If you use
In [322]: pp.nestedExpr('START', 'END').searchString(data).asList()
Out[322]:
[[['line', '2', 'line', '3', 'line', '4']],
[['line', 'a', 'line', 'b', 'line', 'c']]]
then the text is split on whitespace. (Notice we have 'line', '2'
above where we want 'line 2'
instead). We'd rather it just split only on '\n'
. So to fix this we can use the pp.nestedExpr
function's content
parameter which allows us to control what is considered an item inside the nested list.
The source code for nestedExpr
defines
content = (Combine(OneOrMore(~ignoreExpr +
~Literal(opener) + ~Literal(closer) +
CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1))
).setParseAction(lambda t:t[0].strip()))
by default, where pp.ParserElement.DEFAULT_WHITE_CHARS
is
In [324]: pp.ParserElement.DEFAULT_WHITE_CHARS
Out[324]: ' \n\t\r'
This is what causes nextExpr
to split on all whitespace.
So if we reduce that to simply '\n'
, then nestedExpr
splits the content by
lines instead of by all whitespace.
import pyparsing as pp
data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""
opener = 'START'
closer = 'END'
content = pp.Combine(pp.OneOrMore(~pp.Literal(opener)
+ ~pp.Literal(closer)
+ pp.CharsNotIn('\n',exact=1)))
expr = pp.nestedExpr(opener, closer, content=content)
result = [item[0] for item in expr.searchString(data).asList()]
print(result)
yields
[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]