SQL parsing using pyparsing

2019-03-31 03:16发布

I am learning PyParsing in last few weeks. I plan to use it to get table names from SQL statements. I have looked at http://pyparsing.wikispaces.com/file/view/simpleSQL.py. But I intend to keep the grammar simple because I am not trying to get every part of select statement parsed rather I am looking for just table names. Also it is quite involved to define the complete grammar for any commercially available modern day database like Teradata.

#!/usr/bin/env python

from pyparsing import *
import sys

semicolon = Combine(Literal(';') + lineEnd)
comma = Literal(',')
lparen = Literal('(')
rparen = Literal(')')

# Keyword definition
update_kw, volatile_kw, create_kw, table_kw, as_kw, from_kw, \
where_kw, join_kw, left_kw, right_kw, cross_kw, outer_kw, \
on_kw , insert_kw , into_kw= \
    map(lambda x: Keyword(x, caseless=True), \
        ['UPDATE', 'VOLATILE', 'CREATE', 'TABLE', 'AS', 'FROM',
         'WHERE', 'JOIN' , 'LEFT', 'RIGHT' , \
         'CROSS', 'OUTER', 'ON', 'INSERT', 'INTO'])

# Teradata SQL allows SELECT and well as SEL keyword
select_kw = Keyword('SELECT', caseless=True) | Keyword('SEL' , caseless=True)

# list of reserved keywords
reserved_words = (update_kw | volatile_kw | create_kw | table_kw | as_kw |
                  select_kw | from_kw | where_kw | join_kw |
                  left_kw | right_kw | cross_kw | on_kw | insert_kw |
                  into_kw)

# Identifier can be used as table or column names. They can't be reserved words
ident = ~reserved_words + Word(alphas, alphanums + '_')

# Recursive definition for table
table = Forward()
# simple table name can be identifer or qualified identifier e.g. schema.table
simple_table = Combine(Optional(ident + Literal('.')) + ident)
# table name can also a complete select statement used as table
nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \   
               from_kw.suppress() + table + rparen.suppress()
# table can be simple table or nested table
table << (nested_table | simple_table)
# comma delimited list of tables
table_list = delimitedList(table)
# Building from clause only because table name(s) will always appears after that
from_clause = from_kw.suppress() + table_list


txt = """
SELECT p, (SELECT * FROM foo),e FROM a, d, (SELECT * FROM z), b
"""
for token, start, end in from_clause.scanString(txt):
    print token

A thing worth mentioning here. I use "SkipTo(from_kw)" to jump over column list in SQL statement. This is primarily to avoid defining grammar for column list which can be comma delimited list of identifiers, many function names, DW analytical functions and what not. With this grammar I am able to parse above statement as well as any level of nesting in SELECT column list or table list.

['foo']
['a', 'd', 'z', 'b']

I am facing problem when SELECT has where clause:

nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \   
               from_kw.suppress() + table + rparen.suppress()

When WHERE clause is there then the same statement may look like: SELECT ... FROM a,d , (SELECT * FROM z WHERE (c1 = 1) and (c2 = 3)), p I thought of changing "nested_table" definition to:

nested_table = lparen.suppress() + select_kw.suppress() + SkipTo(from_kw).suppress() + \   
               from_kw.suppress() + table + Optional(where_kw + SkipTo(rparen)) + rparen

But this is not working since it matches to the right parenthesis following "c = 1". What I would like to know is how to skip to the right parenthesis that matches left parenthesis right before "SELECT * FROM z..." I don't know how to do it using PyParsing

Also on a different note I seek some advice the best way to get table names from complex nested SQLs. Any help is really appreciated.

Thanks Abhijit

1条回答
我想做一个坏孩纸
2楼-- · 2019-03-31 04:08

Considering that you are also trying to parse out nested SELECT's, I don't think you'll be able to avoid writing a fairly complete SQL parser. Fortunately, there is a more complete example on the Pyparsing wiki Examples page, select_parser.py. I hope that gets you further along.

查看更多
登录 后发表回答