pyparsing whitespace match issues

2020-07-11 05:18发布

问题:

I tried to use pyparsing to parse robotframework, which is a text based DSL. The sytnax is like following ( sorry, but I think it's a little hard for me to describe it in BNF). a single line in robotframework may looks like:

Library\tSSHClient    with name\tnode

\t is tab, and in robotframework, it is transparently transfered to 2 " "(In fact, it just call str.replace('\t', ' ') to replace the tab, but it will modified the actually length of each line, len('\t') is 1 but len(' ') is 2.). In robot, 2 and more whitespaces and '\t' are used to split the token, if there are only 1 whitespaces between words, then the words are considered to be a token group.

Library\tSSHClient    with name\tnode

is actually splitted to the following tokens if parsed correctly:

 ['Library', 'SSHClient', 'with name', 'node']

As there is only 1 whitespace between "with" and "name", so the parser considers it belong to a group syntax token.

Here is my code:

ParserElement.setDefaultWhitespaceChars('\r\n\t ')
source = "Library\tSSHClient    with name\tnode"
EACH_LINE = Optional(Word(" ")).leaveWhitespace().suppress() + \
            CaselessKeyword("library").suppress() + \
            OneOrMore((Word(alphas)) + White(max=1).setResultName('myValue')) +\
            SkipTo(LineEnd())

res = EACH_LINE.parseString(source)
print res.myValue

Questions:

1) I already set the WhiteSpaces, If I want to exactly matched 2 or more Whitespaces OR one or more Tab, I thought the code would like: White(ws=' ', min=2)| White(ws='\t', min=1) but this will fail, so I could not specify the whitespace value?

2) Is there a way to get the matched result index? I tried the setParseAction, but it seems I could not get the index by this callback. I need both start and end index to highlight the word.

3) What does LineStart and LineEnd means ? I print these values, it seems they are just normal string, does I have to write something in the front of a line like: LineStart() + balabala... + LineEnd() ?

Thanks, however, there is a restriction that I could not replace '\t' to ' '

from pyparsing import *

source = "Library\tsshclient\t\t\twith name    s1"

value = Combine(OneOrMore(Word(printables) | White(' ', max=1) + ~White()))  #here it seems the whitespace has already been set to ' ', why the result still match '\t'?

linedefn = OneOrMore(value)

res = linedefn.parseString(source)

print res

I got

['Library sshclient', 'with name', 's1']

but I expected ['Library', 'sshclient', 'with name', 's1']

回答1:

I always flinch when whitespace creeps into parsed tokens, but with your constraints that only single spaces are allowed, this should be workable. I used the following expression to define your values that could have embedded single spaces:

# each value consists of printable words separated by at most a 
# single space (a space that is not followed by another space)
value = Combine(OneOrMore(Word(printables) | White(' ',max=1) + ~White()))

With this done, a line is just one or more of these values:

linedefn = OneOrMore(value)

Following your example, including calling str.replace to replace tabs with pairs of spaces, the code looks like:

data = "Library\tSSHClient    with name\tnode"

# replace tabs with 2 spaces
data = data.replace('\t', '  ')

print linedefn.parseString(data)

Giving:

['Library', 'SSHClient', 'with name', 'node']

To get the start and end locations of any values in the original string, wrap the expression in the new pyparsing helper method locatedExpr:

# use new locatedExpr to get the value, start, and end location 
# for each value
linedefn = OneOrMore(locatedExpr(value))('values')

If we parse and dump the results:

print linedefn.parseString(data).dump()

We get:

- values: 
  [0]:
    [0, 'Library', 7]
    - locn_end: 7
    - locn_start: 0
    - value: Library
  [1]:
    [9, 'SSHClient', 18]
    - locn_end: 18
    - locn_start: 9
    - value: SSHClient
  [2]:
    [22, 'with name', 31]
    - locn_end: 31
    - locn_start: 22
    - value: with name
  [3]:
    [33, 'node', 37]
    - locn_end: 37
    - locn_start: 33
    - value: node

LineStart and LineEnd are pyparsing expression classes whose instances should match at the start and end of a line. LineStart has always been difficult to work with, but LineEnd is fairly predictable. In your case, if you just read and parse a line at a time, then you shouldn't need them - just define the contents of the line that you expect. If you want to ensure that the parser has processed the entire string (and not stopped short of the end because of a non-matching character), add + LineEnd() or + StringEnd() to the end of your parser, or add the argument parseAll=True to your call to parseString().

EDIT:

It is easy to forget that pyparsing calls str.expandtabs by default - you have to disable this by calling parseWithTabs. That, and explicitly disallowing TABs between value words resolves your problem, and keeps the values at the correct character counts. See changes below:

from pyparsing import *
TAB = White('\t')

# each value consists of printable words separated by at most a 
# single space (a space that is not followed by another space)
value = Combine(OneOrMore(~TAB + (Word(printables) | White(' ',max=1) + ~White())))

# each line has one or more of these values
linedefn = OneOrMore(value)
# do not expand tabs before parsing
linedefn.parseWithTabs()


data = "Library\tSSHClient    with name\tnode"

# replace tabs with 2 spaces
#data = data.replace('\t', '  ')

print linedefn.parseString(data)


linedefn = OneOrMore(locatedExpr(value))('values')
# do not expand tabs before parsing
linedefn.parseWithTabs()
print linedefn.parseString(data).dump()