I am learning PyParsing in last few weeks. I plan to use it to get table names from SQL statements. I have looked at http://pyparsing.wikispaces.com/file/view/simpleSQL.py. But I intend to keep the grammar simple because I am not trying to get every part of select statement parsed rather I am looking for just table names. Also it is quite involved to define the complete grammar for any commercially available modern day database like Teradata.
#!/usr/bin/env python
from pyparsing import *
import sys
semicolon = Combine(Literal(';') + lineEnd)
comma = Literal(',')
lparen = Literal('(')
rparen = Literal(')')
# Keyword definition
update_kw, volatile_kw, create_kw, table_kw, as_kw, from_kw, \
where_kw, join_kw, left_kw, right_kw, cross_kw, outer_kw, \
on_kw , insert_kw , into_kw= \
map(lambda x: Keyword(x, caseless=True), \
['UPDATE', 'VOLATILE', 'CREATE', 'TABLE', 'AS', 'FROM',
'WHERE', 'JOIN' , 'LEFT', 'RIGHT' , \
'CROSS', 'OUTER', 'ON', 'INSERT', 'INTO'])
# Teradata SQL allows SELECT and well as SEL keyword
select_kw = Keyword('SELECT', caseless=True) | Keyword('SEL' , caseless=True)
# list of reserved keywords
reserved_words = (update_kw | volatile_kw | create_kw | table_kw | as_kw |
select_kw | from_kw | where_kw | join_kw |
left_kw | right_kw | cross_kw | on_kw | insert_kw |
into_kw)
# Identifier can be used as table or column names. They can't be reserved words
ident = ~reserved_words + Word(alphas, alphanums + '_')
# Recursive definition for table
table = Forward()
# simple table name can be identifer or qualified identifier e.g. schema.table
simple_table = Combine(Optional(ident + Literal('.')) + ident)
# table name can also a complete select statement used as table
nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \
from_kw.suppress() + table + rparen.suppress()
# table can be simple table or nested table
table << (nested_table | simple_table)
# comma delimited list of tables
table_list = delimitedList(table)
# Building from clause only because table name(s) will always appears after that
from_clause = from_kw.suppress() + table_list
txt = """
SELECT p, (SELECT * FROM foo),e FROM a, d, (SELECT * FROM z), b
"""
for token, start, end in from_clause.scanString(txt):
print token
A thing worth mentioning here. I use "SkipTo(from_kw)" to jump over column list in SQL statement. This is primarily to avoid defining grammar for column list which can be comma delimited list of identifiers, many function names, DW analytical functions and what not. With this grammar I am able to parse above statement as well as any level of nesting in SELECT column list or table list.
['foo']
['a', 'd', 'z', 'b']
I am facing problem when SELECT has where clause:
nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \
from_kw.suppress() + table + rparen.suppress()
When WHERE clause is there then the same statement may look like: SELECT ... FROM a,d , (SELECT * FROM z WHERE (c1 = 1) and (c2 = 3)), p I thought of changing "nested_table" definition to:
nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \
from_kw.suppress() + table + Optional(where_kw + SkipTo(rparen)) + rparen
But this is not working since it matches to the right parenthesis following "c = 1". What I would like to know is how to skip to the right parenthesis that matches left parenthesis right before "SELECT * FROM z..." I don't know how to do it using PyParsing
Also on a different note I seek some advice the best way to get table names from complex nested SQLs. Any help is really appreciated.
Thanks Abhijit
Considering that you are also trying to parse out nested SELECT's, I don't think you'll be able to avoid writing a fairly complete SQL parser. Fortunately, there is a more complete example on the Pyparsing wiki Examples page, select_parser.py. I hope that gets you further along.