Access parsed elements using Pyparsing

2019-04-02 00:12发布

I have a bunch of sentences which I need to parse and convert to corresponding regex search code. Examples of my sentences -

LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we

-This means in the line, phrase one comes somewhere before phrase2 and phrase3. Also, the line must start with Therefore we

LINE_CONTAINS abc {upto 4 words} xyz {upto 3 words} pqr

-This means I need to allow upto 4 words between the first 2 phrases and upto 3 words between last 2 phrases

Using help from Paul Mcguire (here), the following grammar was written -

from pyparsing import (CaselessKeyword, Word, alphanums, nums, MatchFirst, quotedString, 
    infixNotation, Combine, opAssoc, Suppress, pyparsing_common, Group, OneOrMore, ZeroOrMore)

LINE_CONTAINS, LINE_STARTSWITH = map(CaselessKeyword,
    """LINE_CONTAINS LINE_STARTSWITH """.split()) 

NOT, AND, OR = map(CaselessKeyword, "NOT AND OR".split())
BEFORE, AFTER, JOIN = map(CaselessKeyword, "BEFORE AFTER JOIN".split())

lpar=Suppress('{') 
rpar=Suppress('}')

keyword = MatchFirst([LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH, NOT, AND, OR, 
                      BEFORE, AFTER, JOIN]) # declaring all keywords and assigning order for all further use

phrase_word = ~keyword + (Word(alphanums + '_'))

upto_N_words = Group(lpar + 'upto' + pyparsing_common.integer('numberofwords') + 'words' + rpar)

phrase_term = Group(OneOrMore(phrase_word) + ZeroOrMore((upto_N_words) + OneOrMore(phrase_word))  



phrase_expr = infixNotation(phrase_term,
                            [
                             ((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
                             (NOT, 1, opAssoc.RIGHT,),
                             (AND, 2, opAssoc.LEFT,),
                             (OR, 2, opAssoc.LEFT),
                            ],
                            lpar=Suppress('{'), rpar=Suppress('}')
                            ) # structure of a single phrase with its operators

line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") + 
                  Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
                                   [(NOT, 1, opAssoc.RIGHT,),
                                    (AND, 2, opAssoc.LEFT,),
                                    (OR, 2, opAssoc.LEFT),
                                    ]
                                   ) # grammar for the entire rule/sentence

sample1 = """
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
"""
sample2 = """
LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else
"""

My question now is - How do I access the parsed elements in order to convert the sentences to my regex code. For this, I tried the following -

parsed = line_contents_expr.parseString(sample1)/(sample2)
print (parsed[0].asDict())
print (parsed)
pprint.pprint(parsed)

The result of the above code for sample1 was -

{}

[[['LINE_CONTAINS', [[['sentence', 'one'], 'BEFORE', [['sentence2'], 'AND', ['sentence3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]]

([([(['LINE_CONTAINS', ([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {})], {'phrase': [(([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]}), 'AND', (['LINE_STARTSWITH', ([(['Therefore', 'we'], {})], {})], {'phrase': [(([(['Therefore', 'we'], {})], {}), 1)], 'line_directive': [('LINE_STARTSWITH', 0)]})], {})], {})

The result of the above code for sample2 was -

{'phrase': [[['abcd', {'numberofwords': 4}, 'xyzw', {'numberofwords': 3}, 'pqrs'], 'BEFORE', ['something', 'else']]], 'line_directive': 'LINE_CONTAINS'}

[['LINE_CONTAINS', [[['abcd', ['upto', 4, 'words'], 'xyzw', ['upto', 3, 'words'], 'pqrs'], 'BEFORE', ['something', 'else']]]]]

([(['LINE_CONTAINS', ([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {})], {'phrase': [(([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]})], {})

My questions based on the above output are -

  1. Why does the pprint (pretty print) have more detailed parsing than normal print?
  2. Why does the asDict() method give no output for sample1 but does for sample2?
  3. Whenever I try to access the parsed elements using print (parsed.numberofwords) or parsed.line_directive or parsed.line_term, it gives me nothing. How can I access these elements in order to use them to build my regex codes?

1条回答
在下西门庆
2楼-- · 2019-04-02 00:31

To answer your printing questions. 1) pprint is there to pretty print a nested list of tokens, without showing any results names - it is essentially a wraparound for calling pprint.pprint(results.asList()). 2) asDict() is there to do conversion of your parsed results to an actual Python dict, so it only shows the results names (with nesting if you have names within names).

To view the contents of your parsed output, you are best off using print(result.dump()). dump() shows both the nesting of the results and any named results along the way.

result = line_contents_expr.parseString(sample2)
print(result.dump())

I also recommend using expr.runTests to give you dump() output as well as any exceptions and exception locators. With your code, you could most easily do this using:

line_contents_expr.runTests([sample1, sample2])

But I also suggest you step back a second and think about just what this {upto n words} business is all about. Look at your samples and draw rectangles around the line terms, and then within the line terms draw circles around the phrase terms. (This would be a good exercise in leading up to writing for yourself a BNF description of this grammar, which I always recommend as a getting-your-head-around-the-problem step.) What if you treated the upto expressions as another operator? To see this, change phrase_term back to the way you had it:

phrase_term = Group(OneOrMore(phrase_word))

And then change your first precedence entry in defining a phrase expression to:

    ((BEFORE | AFTER | JOIN | upto_N_words), 2, opAssoc.LEFT,),

Or give some thought to maybe having upto operator at a higher or lower precedence than BEFORE, AFTER, and JOIN, and adjust the precedence list accordingly.

With this change, I get this output from calling runTests on your samples:

LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we

[[['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]]
[0]:
  [['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]
  [0]:
    ['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]]
    - line_directive: 'LINE_CONTAINS'
    - phrase: [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]
      [0]:
        [['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]
        [0]:
          ['phrase', 'one']
        [1]:
          BEFORE
        [2]:
          [['phrase2'], 'AND', ['phrase3']]
          [0]:
            ['phrase2']
          [1]:
            AND
          [2]:
            ['phrase3']
  [1]:
    AND
  [2]:
    ['LINE_STARTSWITH', [['Therefore', 'we']]]
    - line_directive: 'LINE_STARTSWITH'
    - phrase: [['Therefore', 'we']]
      [0]:
        ['Therefore', 'we']



LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else

[['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]]
[0]:
  ['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]
  - line_directive: 'LINE_CONTAINS'
  - phrase: [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]
    [0]:
      [['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]
      [0]:
        ['abcd']
      [1]:
        ['upto', 4, 'words']
        - numberofwords: 4
      [2]:
        ['xyzw']
      [3]:
        ['upto', 3, 'words']
        - numberofwords: 3
      [4]:
        ['pqrs']
      [5]:
        BEFORE
      [6]:
        ['something', 'else']

You can iterate over these results and pick them apart, but you are rapidly reaching the point where you should look at building executable nodes from the different precedence levels - see the SimpleBool.py example on the pyparsing wiki for how to do this.

EDIT: Please review this pared-down version of a parser for phrase_expr, and how it creates Node instances that themselves generate the output. See how numberofwords is accessed on the operator in the UpToNode class. See how "xyz abc" gets interpreted as "xyz AND abc" with an implicit AND operator.

from pyparsing import *
import re

UPTO, WORDS, AND, OR = map(CaselessKeyword, "upto words and or".split())
keyword = UPTO | WORDS | AND | OR
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()

word = ~keyword + Word(alphas)
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)

class Node(object):
    def __init__(self, tokens):
        self.tokens = tokens

    def generate(self):
        pass

class LiteralNode(Node):
    def generate(self):
        return "(%s)" % re.escape(self.tokens[0])
    def __repr__(self):
        return repr(self.tokens[0])

class AndNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        return '.*'.join(t.generate() for t in tokens[::2])

    def __repr__(self):
        return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])

class OrNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        return '|'.join(t.generate() for t in tokens[::2])

    def __repr__(self):
        return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])

class UpToNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        ret = tokens[0].generate()
        word_re = r"\s+\S+"
        space_re = r"\s+"
        for op, operand in zip(tokens[1::2], tokens[2::2]):
            # op contains the parsed "upto" expression
            ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
        return ret

    def __repr__(self):
        tokens = self.tokens[0]
        ret = repr(tokens[0])
        for op, operand in zip(tokens[1::2], tokens[2::2]):
            # op contains the parsed "upto" expression
            ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
        return ret

IMPLICIT_AND = Empty().setParseAction(replaceWith("AND"))

phrase_expr = infixNotation(word.setParseAction(LiteralNode),
        [
        (upto_expr, 2, opAssoc.LEFT, UpToNode),
        (AND | IMPLICIT_AND, 2, opAssoc.LEFT, AndNode),
        (OR, 2, opAssoc.LEFT, OrNode),
        ])

tests = """\
        xyz
        xyz abc
        xyz {upto 4 words} def""".splitlines()

for t in tests:
    t = t.strip()
    if not t:
        continue
    print(t)
    try:
        parsed = phrase_expr.parseString(t)
    except ParseException as pe:
        print(' '*pe.loc + '^')
        print(pe)
        continue
    print(parsed)
    print(parsed[0].generate())
    print()

prints:

xyz
['xyz']
(xyz)

xyz abc
['xyz' AND 'abc']
(xyz).*(abc)

xyz {upto 4 words} def
['xyz' {0-4 WORDS} 'def']
(xyz)((\s+\S+){0,4}\s+)(def)

Expand on this to support your LINE_xxx expressions.

查看更多
登录 后发表回答