I am at a loss. I have been trying to get this to work for days now. But I am not getting anywhere with this, so I figured I'd consult you guys here and see if someone is able to help me!
I am using pyparsing in an attempt to parse one query format to another one. This is not a simple transformation but actually takes some brains :)
The current query is the following:
("breast neoplasms"[MeSH Terms] OR breast cancer[Acknowledgments]
OR breast cancer[Figure/Table Caption] OR breast cancer[Section Title]
OR breast cancer[Body - All Words] OR breast cancer[Title]
OR breast cancer[Abstract] OR breast cancer[Journal])
AND (prevention[Acknowledgments] OR prevention[Figure/Table Caption]
OR prevention[Section Title] OR prevention[Body - All Words]
OR prevention[Title] OR prevention[Abstract])
And using pyparsing I have been able to get the following structure:
[[[['"', 'breast', 'neoplasms', '"'], ['MeSH', 'Terms']], 'or',
[['breast', 'cancer'], ['Acknowledgments']], 'or', [['breast', 'cancer'],
['Figure/Table', 'Caption']], 'or', [['breast', 'cancer'], ['Section',
'Title']], 'or', [['breast', 'cancer'], ['Body', '-', 'All', 'Words']],
'or', [['breast', 'cancer'], ['Title']], 'or', [['breast', 'cancer'],
['Abstract']], 'or', [['breast', 'cancer'], ['Journal']]], 'and',
[[['prevention'], ['Acknowledgments']], 'or', [['prevention'],
['Figure/Table', 'Caption']], 'or', [['prevention'], ['Section', 'Title']],
'or', [['prevention'], ['Body', '-', 'All', 'Words']], 'or',
[['prevention'], ['Title']], 'or', [['prevention'], ['Abstract']]]]
But now, I am at a loss. I need to format the above output to a lucene search query. Here is a short example on the transformations required:
"breast neoplasms"[MeSH Terms] --> [['"', 'breast', 'neoplasms', '"'],
['MeSH', 'Terms']] --> mesh terms: "breast neoplasms"
But I am stuck right there. I also need to be able to make use of the special words AND and OR.
so a final query might be: mesh terms: "breast neoplasms" and prevention
Who can help me and give me some hints on how to solve this? Any kind of help would be appreciated.
Since I am using pyparsing, I am bount to python. I have pasted the code below so that you can play around with it and dont have to start at 0!
Thanks so much for the help!
def PubMedQueryParser():
word = Word(alphanums +".-/&§")
complex_structure = Group(Literal('"') + OneOrMore(word) + Literal('"')) + Suppress('[') + Group(OneOrMore(word)) + Suppress(']')
medium_structure = Group(OneOrMore(word)) + Suppress('[') + Group(OneOrMore(word)) + Suppress(']')
easy_structure = Group(OneOrMore(word))
parse_structure = complex_structure | medium_structure | easy_structure
operators = oneOf("and or", caseless=True)
expr = Forward()
atom = Group(parse_structure) + ZeroOrMore(operators + expr)
atom2 = Group(Suppress('(') + atom + Suppress(')')) + ZeroOrMore(operators + expr) | atom
expr << atom2
return expr
Well, you have gotten yourself off to a decent start. But from here, it is easy to get bogged down in details of parser-tweaking, and you could be in that mode for days. Let's step through your problem beginning with the original query syntax.
When you start out with a project like this, write a BNF of the syntax you want to parse. It doesn't have to be super rigorous, in fact, here is a start at one based on what I can see from your sample:
That's pretty close - we have a slight problem with some possible ambiguity between
word
and theand_op
andor_op
expressions, since 'and' and 'or' do match the definition of a word. We'll need to tighten this up at implementation time, to make sure that "cancer or carcinoma or lymphoma or melanoma" gets read as 4 different search terms separated by 'or's, not just one big term (which I think is what your current parser would do). We also get the benefit of recognizing precedence of operators - maybe not strictly necessary, but let's go with it for now.Converting to pyparsing is simple enough:
To address the ambiguity of 'or' and 'and', we put a negative lookahead at the beginning of word:
To give some structure to the results, wrap in
Group
classes:Now parsing your sample text with:
gives:
Actually, pretty similar to the results from your parser. We could now recurse through this structure and build up your new query string, but I prefer to do this using parsed objects, created at parse time by defining classes as token containers instead of
Group
s, and then adding behavior to the classes to get our desired output. The distinction is that our parsed object token containers can have behavior that is specific to the kind of expression that was parsed.We'll begin with a base abstract class,
ParsedObject
, that will take the parsed tokens as its initializing structure. We'll also add an abstract method,queryString
, which we'll implement in all the deriving classes to create your desired output:Now we can derive from this class, and any subclass can be used as a parse action in defining the grammar.
When we do this,
Group
s that were added for structure kind of get in our way, so we'll redefine the original parser without them:Now we implement the class for
search_term
, usingself.tokens
to access the parsed bits found in the input string:Next we'll implement the
and_term
andor_term
expressions. Both are binary operators differing only in their resulting operator string in the output query, so we can just define one class and let them provide a class constant for their respective operator strings:Note that pyparsing is a little different from traditional parsers - our
BinaryOperation
will match "a or b or c" as a single expression, not as the nested pairs "(a or b) or c". So we have to rejoin all of the terms using the stepping slice[0::2]
.Finally, we add a parse action to reflect any nesting by wrapping all exprs in ()'s:
For your convenience, here is the entire parser in one copy/pastable block:
Which prints the following:
I'm guessing you'll need to tighten up some of this output - those lucene tag names look very ambiguous - I was just following your posted sample. But you shouldn't have to change the parser much, just adjust the
queryString
methods of the attached classes.As an added exercise to the poster: add support for NOT boolean operator in your query language.