I have a bunch of sentences which I need to parse and convert to corresponding regex search code. Examples of my sentences -
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
-This means in the line, phrase one
comes somewhere before
phrase2
and phrase3
. Also, the line must start with Therefore we
LINE_CONTAINS abc {upto 4 words} xyz {upto 3 words} pqr
-This means I need to allow upto 4 words between the first 2 phrases and upto 3 words between last 2 phrases
Using help from Paul Mcguire (here), the following grammar was written -
from pyparsing import (CaselessKeyword, Word, alphanums, nums, MatchFirst, quotedString,
infixNotation, Combine, opAssoc, Suppress, pyparsing_common, Group, OneOrMore, ZeroOrMore)
LINE_CONTAINS, LINE_STARTSWITH = map(CaselessKeyword,
"""LINE_CONTAINS LINE_STARTSWITH """.split())
NOT, AND, OR = map(CaselessKeyword, "NOT AND OR".split())
BEFORE, AFTER, JOIN = map(CaselessKeyword, "BEFORE AFTER JOIN".split())
lpar=Suppress('{')
rpar=Suppress('}')
keyword = MatchFirst([LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH, NOT, AND, OR,
BEFORE, AFTER, JOIN]) # declaring all keywords and assigning order for all further use
phrase_word = ~keyword + (Word(alphanums + '_'))
upto_N_words = Group(lpar + 'upto' + pyparsing_common.integer('numberofwords') + 'words' + rpar)
phrase_term = Group(OneOrMore(phrase_word) + ZeroOrMore((upto_N_words) + OneOrMore(phrase_word))
phrase_expr = infixNotation(phrase_term,
[
((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
(NOT, 1, opAssoc.RIGHT,),
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") +
Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
[(NOT, 1, opAssoc.RIGHT,),
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
]
) # grammar for the entire rule/sentence
sample1 = """
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
"""
sample2 = """
LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else
"""
My question now is - How do I access the parsed elements in order to convert the sentences to my regex code. For this, I tried the following -
parsed = line_contents_expr.parseString(sample1)/(sample2)
print (parsed[0].asDict())
print (parsed)
pprint.pprint(parsed)
The result of the above code for sample1
was -
{}
[[['LINE_CONTAINS', [[['sentence', 'one'], 'BEFORE', [['sentence2'], 'AND', ['sentence3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]]
([([(['LINE_CONTAINS', ([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {})], {'phrase': [(([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]}), 'AND', (['LINE_STARTSWITH', ([(['Therefore', 'we'], {})], {})], {'phrase': [(([(['Therefore', 'we'], {})], {}), 1)], 'line_directive': [('LINE_STARTSWITH', 0)]})], {})], {})
The result of the above code for sample2
was -
{'phrase': [[['abcd', {'numberofwords': 4}, 'xyzw', {'numberofwords': 3}, 'pqrs'], 'BEFORE', ['something', 'else']]], 'line_directive': 'LINE_CONTAINS'}
[['LINE_CONTAINS', [[['abcd', ['upto', 4, 'words'], 'xyzw', ['upto', 3, 'words'], 'pqrs'], 'BEFORE', ['something', 'else']]]]]
([(['LINE_CONTAINS', ([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {})], {'phrase': [(([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]})], {})
My questions based on the above output are -
- Why does the pprint (pretty print) have more detailed parsing than normal print?
- Why does the
asDict()
method give no output forsample1
but does forsample2
? - Whenever I try to access the parsed elements using
print (parsed.numberofwords)
orparsed.line_directive
orparsed.line_term
, it gives me nothing. How can I access these elements in order to use them to build my regex codes?
To answer your printing questions. 1)
pprint
is there to pretty print a nested list of tokens, without showing any results names - it is essentially a wraparound for callingpprint.pprint(results.asList())
. 2)asDict()
is there to do conversion of your parsed results to an actual Python dict, so it only shows the results names (with nesting if you have names within names).To view the contents of your parsed output, you are best off using
print(result.dump())
.dump()
shows both the nesting of the results and any named results along the way.I also recommend using
expr.runTests
to give youdump()
output as well as any exceptions and exception locators. With your code, you could most easily do this using:But I also suggest you step back a second and think about just what this
{upto n words}
business is all about. Look at your samples and draw rectangles around the line terms, and then within the line terms draw circles around the phrase terms. (This would be a good exercise in leading up to writing for yourself a BNF description of this grammar, which I always recommend as a getting-your-head-around-the-problem step.) What if you treated theupto
expressions as another operator? To see this, changephrase_term
back to the way you had it:And then change your first precedence entry in defining a phrase expression to:
Or give some thought to maybe having
upto
operator at a higher or lower precedence than BEFORE, AFTER, and JOIN, and adjust the precedence list accordingly.With this change, I get this output from calling runTests on your samples:
You can iterate over these results and pick them apart, but you are rapidly reaching the point where you should look at building executable nodes from the different precedence levels - see the SimpleBool.py example on the pyparsing wiki for how to do this.
EDIT: Please review this pared-down version of a parser for
phrase_expr
, and how it createsNode
instances that themselves generate the output. See hownumberofwords
is accessed on the operator in theUpToNode
class. See how "xyz abc" gets interpreted as "xyz AND abc" with an implicit AND operator.prints:
Expand on this to support your
LINE_xxx
expressions.