pyparsing one query format to another one

I am at a loss. I have been trying to get this to work for days now. But I am not getting anywhere with this, so I figured I'd consult you guys here and see if someone is able to help me!

I am using pyparsing in an attempt to parse one query format to another one. This is not a simple transformation but actually takes some brains :)

The current query is the following:

("breast neoplasms"[MeSH Terms] OR breast cancer[Acknowledgments] 
OR breast cancer[Figure/Table Caption] OR breast cancer[Section Title] 
OR breast cancer[Body - All Words] OR breast cancer[Title] 
OR breast cancer[Abstract] OR breast cancer[Journal]) 
AND (prevention[Acknowledgments] OR prevention[Figure/Table Caption] 
OR prevention[Section Title] OR prevention[Body - All Words] 
OR prevention[Title] OR prevention[Abstract])

And using pyparsing I have been able to get the following structure:

[[[['"', 'breast', 'neoplasms', '"'], ['MeSH', 'Terms']], 'or',
[['breast', 'cancer'], ['Acknowledgments']], 'or', [['breast', 'cancer'],
['Figure/Table', 'Caption']], 'or', [['breast', 'cancer'], ['Section', 
'Title']], 'or', [['breast', 'cancer'], ['Body', '-', 'All', 'Words']], 
'or', [['breast', 'cancer'], ['Title']], 'or', [['breast', 'cancer'], 
['Abstract']], 'or', [['breast', 'cancer'], ['Journal']]], 'and', 
[[['prevention'], ['Acknowledgments']], 'or', [['prevention'], 
['Figure/Table', 'Caption']], 'or', [['prevention'], ['Section', 'Title']], 
'or', [['prevention'], ['Body', '-', 'All', 'Words']], 'or', 
[['prevention'], ['Title']], 'or', [['prevention'], ['Abstract']]]]

But now, I am at a loss. I need to format the above output to a lucene search query. Here is a short example on the transformations required:

"breast neoplasms"[MeSH Terms] --> [['"', 'breast', 'neoplasms', '"'], 
['MeSH', 'Terms']] --> mesh terms: "breast neoplasms"

But I am stuck right there. I also need to be able to make use of the special words AND and OR.

so a final query might be: mesh terms: "breast neoplasms" and prevention

Who can help me and give me some hints on how to solve this? Any kind of help would be appreciated.

Since I am using pyparsing, I am bount to python. I have pasted the code below so that you can play around with it and dont have to start at 0!

Thanks so much for the help!

def PubMedQueryParser():
    word = Word(alphanums +".-/&§")
    complex_structure = Group(Literal('"') + OneOrMore(word) + Literal('"')) + Suppress('[') + Group(OneOrMore(word)) + Suppress(']')
    medium_structure = Group(OneOrMore(word)) + Suppress('[') + Group(OneOrMore(word)) + Suppress(']')
    easy_structure = Group(OneOrMore(word))
    parse_structure = complex_structure | medium_structure | easy_structure
    operators = oneOf("and or", caseless=True)
    expr = Forward()
    atom = Group(parse_structure) + ZeroOrMore(operators + expr)
    atom2 = Group(Suppress('(') + atom + Suppress(')')) + ZeroOrMore(operators + expr) | atom
    expr << atom2
    return expr

Well, you have gotten yourself off to a decent start. But from here, it is easy to get bogged down in details of parser-tweaking, and you could be in that mode for days. Let's step through your problem beginning with the original query syntax.

When you start out with a project like this, write a BNF of the syntax you want to parse. It doesn't have to be super rigorous, in fact, here is a start at one based on what I can see from your sample:

word :: Word('a'-'z', 'A'-'Z', '0'-'9', '.-/&§')
field_qualifier :: '[' word+ ']'
search_term :: (word+ | quoted_string) field_qualifier?
and_op :: 'and'
or_op :: 'or'
and_term :: or_term (and_op or_term)*
or_term :: atom (or_op atom)*
atom :: search_term | ('(' and_term ')')

That's pretty close - we have a slight problem with some possible ambiguity between word and the and_op and or_op expressions, since 'and' and 'or' do match the definition of a word. We'll need to tighten this up at implementation time, to make sure that "cancer or carcinoma or lymphoma or melanoma" gets read as 4 different search terms separated by 'or's, not just one big term (which I think is what your current parser would do). We also get the benefit of recognizing precedence of operators - maybe not strictly necessary, but let's go with it for now.

Converting to pyparsing is simple enough:

LBRACK,RBRACK,LPAREN,RPAREN = map(Suppress,"[]()")
and_op = CaselessKeyword('and')
or_op = CaselessKeyword('or')
word = Word(alphanums + '.-/&')

field_qualifier = LBRACK + OneOrMore(word) + RBRACK
search_term = ((Group(OneOrMore(word)) | quoted_string)('search_text') + 
               Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term

To address the ambiguity of 'or' and 'and', we put a negative lookahead at the beginning of word:

word = ~(and_op | or_op) + Word(alphanums + '.-/&')

To give some structure to the results, wrap in Group classes:

field_qualifier = Group(LBRACK + OneOrMore(word) + RBRACK)
search_term = Group(Group(OneOrMore(word) | quotedString)('search_text') +
                          Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = Group(atom + ZeroOrMore(or_op + atom))
and_term = Group(or_term + ZeroOrMore(and_op + or_term))
expr << and_term

Now parsing your sample text with:

res = expr.parseString(test)
from pprint import pprint
pprint(res.asList())

gives:

[[[[[[['"breast neoplasms"'], ['MeSH', 'Terms']],
     'or',
     [['breast', 'cancer'], ['Acknowledgments']],
     'or',
     [['breast', 'cancer'], ['Figure/Table', 'Caption']],
     'or',
     [['breast', 'cancer'], ['Section', 'Title']],
     'or',
     [['breast', 'cancer'], ['Body', '-', 'All', 'Words']],
     'or',
     [['breast', 'cancer'], ['Title']],
     'or',
     [['breast', 'cancer'], ['Abstract']],
     'or',
     [['breast', 'cancer'], ['Journal']]]]],
  'and',
  [[[[['prevention'], ['Acknowledgments']],
     'or',
     [['prevention'], ['Figure/Table', 'Caption']],
     'or',
     [['prevention'], ['Section', 'Title']],
     'or',
     [['prevention'], ['Body', '-', 'All', 'Words']],
     'or',
     [['prevention'], ['Title']],
     'or',
     [['prevention'], ['Abstract']]]]]]]

Actually, pretty similar to the results from your parser. We could now recurse through this structure and build up your new query string, but I prefer to do this using parsed objects, created at parse time by defining classes as token containers instead of Groups, and then adding behavior to the classes to get our desired output. The distinction is that our parsed object token containers can have behavior that is specific to the kind of expression that was parsed.

We'll begin with a base abstract class, ParsedObject, that will take the parsed tokens as its initializing structure. We'll also add an abstract method, queryString, which we'll implement in all the deriving classes to create your desired output:

class ParsedObject(object):
    def __init__(self, tokens):
        self.tokens = tokens
    def queryString(self):
        '''Abstract method to be overridden in subclasses'''

Now we can derive from this class, and any subclass can be used as a parse action in defining the grammar.

When we do this, Groups that were added for structure kind of get in our way, so we'll redefine the original parser without them:

search_term = Group(OneOrMore(word) | quotedString)('search_text') + 
                    Optional(field_qualifier)('field')
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term

Now we implement the class for search_term, using self.tokens to access the parsed bits found in the input string:

class SearchTerm(ParsedObject):
    def queryString(self):
        text = ' '.join(self.tokens.search_text)
        if self.tokens.field:
            return '%s: %s' % (' '.join(f.lower() 
                                        for f in self.tokens.field[0]),text)
        else:
            return text
search_term.setParseAction(SearchTerm)

Next we'll implement the and_term and or_term expressions. Both are binary operators differing only in their resulting operator string in the output query, so we can just define one class and let them provide a class constant for their respective operator strings:

class BinaryOperation(ParsedObject):
    def queryString(self):
        joinstr = ' %s ' % self.op
        return joinstr.join(t.queryString() for t in self.tokens[0::2])
class OrOperation(BinaryOperation):
    op = "OR"
class AndOperation(BinaryOperation):
    op = "AND"
or_term.setParseAction(OrOperation)
and_term.setParseAction(AndOperation)

Note that pyparsing is a little different from traditional parsers - our BinaryOperation will match "a or b or c" as a single expression, not as the nested pairs "(a or b) or c". So we have to rejoin all of the terms using the stepping slice [0::2].

Finally, we add a parse action to reflect any nesting by wrapping all exprs in ()'s:

class Expr(ParsedObject):
    def queryString(self):
        return '(%s)' % self.tokens[0].queryString()
expr.setParseAction(Expr)

For your convenience, here is the entire parser in one copy/pastable block:

from pyparsing import *

LBRACK,RBRACK,LPAREN,RPAREN = map(Suppress,"[]()")
and_op = CaselessKeyword('and')
or_op = CaselessKeyword('or')
word = ~(and_op | or_op) + Word(alphanums + '.-/&')
field_qualifier = Group(LBRACK + OneOrMore(word) + RBRACK)

search_term = (Group(OneOrMore(word) | quotedString)('search_text') + 
               Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term

# define classes for parsed structure
class ParsedObject(object):
    def __init__(self, tokens):
        self.tokens = tokens
    def queryString(self):
        '''Abstract method to be overridden in subclasses'''

class SearchTerm(ParsedObject):
    def queryString(self):
        text = ' '.join(self.tokens.search_text)
        if self.tokens.field:
            return '%s: %s' % (' '.join(f.lower() 
                                        for f in self.tokens.field[0]),text)
        else:
            return text
search_term.setParseAction(SearchTerm)

class BinaryOperation(ParsedObject):
    def queryString(self):
        joinstr = ' %s ' % self.op
        return joinstr.join(t.queryString() 
                                for t in self.tokens[0::2])
class OrOperation(BinaryOperation):
    op = "OR"
class AndOperation(BinaryOperation):
    op = "AND"
or_term.setParseAction(OrOperation)
and_term.setParseAction(AndOperation)

class Expr(ParsedObject):
    def queryString(self):
        return '(%s)' % self.tokens[0].queryString()
expr.setParseAction(Expr)


test = """("breast neoplasms"[MeSH Terms] OR breast cancer[Acknowledgments]  
OR breast cancer[Figure/Table Caption] OR breast cancer[Section Title]  
OR breast cancer[Body - All Words] OR breast cancer[Title]  
OR breast cancer[Abstract] OR breast cancer[Journal])  
AND (prevention[Acknowledgments] OR prevention[Figure/Table Caption]  
OR prevention[Section Title] OR prevention[Body - All Words]  
OR prevention[Title] OR prevention[Abstract])"""

res = expr.parseString(test)[0]
print res.queryString()

Which prints the following:

((mesh terms: "breast neoplasms" OR acknowledgments: breast cancer OR 
  figure/table caption: breast cancer OR section title: breast cancer OR 
  body - all words: breast cancer OR title: breast cancer OR 
  abstract: breast cancer OR journal: breast cancer) AND 
 (acknowledgments: prevention OR figure/table caption: prevention OR 
  section title: prevention OR body - all words: prevention OR 
  title: prevention OR abstract: prevention))

I'm guessing you'll need to tighten up some of this output - those lucene tag names look very ambiguous - I was just following your posted sample. But you shouldn't have to change the parser much, just adjust the queryString methods of the attached classes.

As an added exercise to the poster: add support for NOT boolean operator in your query language.