How best to parse a simple grammar?

2019-01-21 06:24发布

问题:

Ok, so I've asked a bunch of smaller questions about this project, but I still don't have much confidence in the designs I'm coming up with, so I'm going to ask a question on a broader scale.

I am parsing pre-requisite descriptions for a course catalog. The descriptions almost always follow a certain form, which makes me think I can parse most of them.

From the text, I would like to generate a graph of course pre-requisite relationships. (That part will be easy, after I have parsed the data.)

Some sample inputs and outputs:

"CS 2110" => ("CS", 2110) # 0

"CS 2110 and INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1
"CS 2110, INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1
"CS 2110, 3300, 3140" => [("CS", 2110), ("CS", 3300), ("CS", 3140)] # 1

"CS 2110 or INFO 3300" => [[("CS", 2110)], [("INFO", 3300)]] # 2

"MATH 2210, 2230, 2310, or 2940" => [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]] # 3  
  1. If the entire description is just a course, it is output directly.

  2. If the courses are conjoined ("and"), they are all output in the same list

  3. If the course are disjoined ("or"), they are in separate lists

  4. Here, we have both "and" and "or".

One caveat that makes it easier: it appears that the nesting of "and"/"or" phrases is never greater than as shown in example 3.

What is the best way to do this? I started with PLY, but I couldn't figure out how to resolve the reduce/reduce conflicts. The advantage of PLY is that it's easy to manipulate what each parse rule generates:

def p_course(p):
 'course : DEPT_CODE COURSE_NUMBER'
 p[0] = (p[1], int(p[2]))

With PyParse, it's less clear how to modify the output of parseString(). I was considering building upon @Alex Martelli's idea of keeping state in an object and building up the output from that, but I'm not sure exactly how that is best done.

 def addCourse(self, str, location, tokens):
  self.result.append((tokens[0][0], tokens[0][1]))

 def makeCourseList(self, str, location, tokens):

  dept = tokens[0][0]
  new_tokens = [(dept, tokens[0][1])]
  new_tokens.extend((dept, tok) for tok in tokens[1:])

  self.result.append(new_tokens)

For instance, to handle "or" cases:

    def __init__(self):
            self.result = []
            # ...
  self.statement = (course_data + Optional(OR_CONJ + course_data)).setParseAction(self.disjunctionCourses)



 def disjunctionCourses(self, str, location, tokens):
  if len(tokens) == 1:
   return tokens

  print "disjunction tokens: %s" % tokens

How does disjunctionCourses() know which smaller phrases to disjoin? All it gets is tokens, but what's been parsed so far is stored in result, so how can the function tell which data in result corresponds to which elements of token? I guess I could search through the tokens, then find an element of result with the same data, but that feel convoluted...

Also, there are many descriptions that include misc text, like:

"CS 2110 or permission of instructor"
"INFO 3140 or equivalent experience"
"PYSCH 2210 and sophomore standing"

But it isn't critical that I parse that text.

What's a better way to approach this problem?

回答1:

def parse(astr):
    astr=astr.replace(',','')
    astr=astr.replace('and','')    
    tokens=astr.split()
    dept=None
    number=None
    result=[]
    option=[]
    for tok in tokens:
        if tok=='or':
            result.append(option)
            option=[]
            continue
        if tok.isalpha():
            dept=tok
            number=None
        else:
            number=int(tok)
        if dept and number:
            option.append((dept,number))
    else:
        if option:
            result.append(option)
    return result

if __name__=='__main__':
    tests=[ ("CS 2110" , [[("CS", 2110)]]),
            ("CS 2110 and INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),
            ("CS 2110, INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),
            ("CS 2110, 3300, 3140", [[("CS", 2110), ("CS", 3300), ("CS", 3140)]]),
            ("CS 2110 or INFO 3300", [[("CS", 2110)], [("INFO", 3300)]]),
            ("MATH 2210, 2230, 2310, or 2940", [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]])]

    for test,answer in tests:
        result=parse(test)
        if result==answer:
            print('GOOD: {0} => {1}'.format(test,answer))
        else:
            print('ERROR: {0} => {1} != {2}'.format(test,result,answer))
            break

yields

GOOD: CS 2110 => [[('CS', 2110)]]
GOOD: CS 2110 and INFO 3300 => [[('CS', 2110), ('INFO', 3300)]]
GOOD: CS 2110, INFO 3300 => [[('CS', 2110), ('INFO', 3300)]]
GOOD: CS 2110, 3300, 3140 => [[('CS', 2110), ('CS', 3300), ('CS', 3140)]]
GOOD: CS 2110 or INFO 3300 => [[('CS', 2110)], [('INFO', 3300)]]
GOOD: MATH 2210, 2230, 2310, or 2940 => [[('MATH', 2210), ('MATH', 2230), ('MATH', 2310)], [('MATH', 2940)]]


回答2:

For simple grammars I really like Parsing Expression Grammars (PEGs), which amount to a disciplined, structured way of writing a recursive-descent parser. In a dynamically typed language like Python you can do useful things without having a separate "parser generator". That means no nonsense with reduce-reduce conflicts or other arcana of LR parsing.

I did a little searching and pyPEG appears to be a nice library for Python.



回答3:

If you get reduce/reduce conflicts you need to specify the precedence of "or" and "and". Im guessing "and" binds tightest, meaning "CS 101 and CS 102 or CS 201" means [[CS 101, CS 102] [CS 201]].

If you can find examples of both then the grammar is ambigous and you are out of luck. However you might be able to let this ambiguity be left underspecified, all depending on what you are going to do with the results.

PS, Looks like the language is regular, you could consider a DFA.



回答4:

I don't pretend to know much about parsing a grammar, and for your case the solution by unutbu is all you'll need. But I learnt a fair bit about parsing from Eric Lippert in his recent series of blog posts.

http://blogs.msdn.com/b/ericlippert/archive/2010/04/26/every-program-there-is-part-one.aspx

It's a 7 part series that goes through creating and parsing a grammar, then optimizing the grammar to make parsing easier and more performant. He produces C# code to generate all combinations of particular grammars, but it shouldn't be too much of a stretch to convert that into python to parse a fairly simple grammar of your own.