I tried to use pyparsing to parse robotframework, which is a text based DSL. The sytnax is like following ( sorry, but I think it's a little hard for me to describe it in BNF).
a single line in robotframework may looks like:
Library\tSSHClient with name\tnode
\t is tab, and in robotframework, it is transparently transfered to 2 " "(In fact, it just call str.replace('\t', ' ') to replace the tab, but it will modified the actually length of each line, len('\t') is 1 but len(' ') is 2.).
In robot, 2 and more whitespaces and '\t' are used to split the token, if there are only 1 whitespaces between words, then the words are considered to be a token group.
Library\tSSHClient with name\tnode
is actually splitted to the following tokens if parsed correctly:
['Library', 'SSHClient', 'with name', 'node']
As there is only 1 whitespace between "with" and "name", so the parser considers it belong to a group syntax token.
Here is my code:
ParserElement.setDefaultWhitespaceChars('\r\n\t ')
source = "Library\tSSHClient with name\tnode"
EACH_LINE = Optional(Word(" ")).leaveWhitespace().suppress() + \
CaselessKeyword("library").suppress() + \
OneOrMore((Word(alphas)) + White(max=1).setResultName('myValue')) +\
SkipTo(LineEnd())
res = EACH_LINE.parseString(source)
print res.myValue
Questions:
1) I already set the WhiteSpaces, If I want to exactly matched 2 or more Whitespaces OR one or more Tab, I thought the code would like:
White(ws=' ', min=2)| White(ws='\t', min=1)
but this will fail, so I could not specify the whitespace value?
2) Is there a way to get the matched result index? I tried the setParseAction, but it seems I could not get the index by this callback. I need both start and end index to highlight the word.
3) What does LineStart and LineEnd means ? I print these values, it seems they are just normal string, does I have to write something in the front of a line like:
LineStart() + balabala... + LineEnd() ?
Thanks, however, there is a restriction that I could not replace '\t' to ' '
from pyparsing import *
source = "Library\tsshclient\t\t\twith name s1"
value = Combine(OneOrMore(Word(printables) | White(' ', max=1) + ~White())) #here it seems the whitespace has already been set to ' ', why the result still match '\t'?
linedefn = OneOrMore(value)
res = linedefn.parseString(source)
print res
I got
['Library sshclient', 'with name', 's1']
but I expected
['Library', 'sshclient', 'with name', 's1']
I always flinch when whitespace creeps into parsed tokens, but with your constraints that only single spaces are allowed, this should be workable. I used the following expression to define your values that could have embedded single spaces:
# each value consists of printable words separated by at most a
# single space (a space that is not followed by another space)
value = Combine(OneOrMore(Word(printables) | White(' ',max=1) + ~White()))
With this done, a line is just one or more of these values:
linedefn = OneOrMore(value)
Following your example, including calling str.replace to replace tabs with pairs of spaces, the code looks like:
data = "Library\tSSHClient with name\tnode"
# replace tabs with 2 spaces
data = data.replace('\t', ' ')
print linedefn.parseString(data)
Giving:
['Library', 'SSHClient', 'with name', 'node']
To get the start and end locations of any values in the original string, wrap the expression in the new pyparsing helper method locatedExpr
:
# use new locatedExpr to get the value, start, and end location
# for each value
linedefn = OneOrMore(locatedExpr(value))('values')
If we parse and dump the results:
print linedefn.parseString(data).dump()
We get:
- values:
[0]:
[0, 'Library', 7]
- locn_end: 7
- locn_start: 0
- value: Library
[1]:
[9, 'SSHClient', 18]
- locn_end: 18
- locn_start: 9
- value: SSHClient
[2]:
[22, 'with name', 31]
- locn_end: 31
- locn_start: 22
- value: with name
[3]:
[33, 'node', 37]
- locn_end: 37
- locn_start: 33
- value: node
LineStart and LineEnd are pyparsing expression classes whose instances should match at the start and end of a line. LineStart has always been difficult to work with, but LineEnd is fairly predictable. In your case, if you just read and parse a line at a time, then you shouldn't need them - just define the contents of the line that you expect. If you want to ensure that the parser has processed the entire string (and not stopped short of the end because of a non-matching character), add + LineEnd()
or + StringEnd()
to the end of your parser, or add the argument parseAll=True
to your call to parseString()
.
EDIT:
It is easy to forget that pyparsing calls str.expandtabs by default - you have to disable this by calling parseWithTabs. That, and explicitly disallowing TABs between value words resolves your problem, and keeps the values at the correct character counts. See changes below:
from pyparsing import *
TAB = White('\t')
# each value consists of printable words separated by at most a
# single space (a space that is not followed by another space)
value = Combine(OneOrMore(~TAB + (Word(printables) | White(' ',max=1) + ~White())))
# each line has one or more of these values
linedefn = OneOrMore(value)
# do not expand tabs before parsing
linedefn.parseWithTabs()
data = "Library\tSSHClient with name\tnode"
# replace tabs with 2 spaces
#data = data.replace('\t', ' ')
print linedefn.parseString(data)
linedefn = OneOrMore(locatedExpr(value))('values')
# do not expand tabs before parsing
linedefn.parseWithTabs()
print linedefn.parseString(data).dump()