In my work I come across a lot of gnarly sql, and I had the bright idea of writing a program to parse the sql and print it out neatly. I made most of it pretty quickly, but I ran into a problem that I don't know how to solve.
So let's pretend the sql is "select foo from bar where 1". My thought was that there is always a keyword followed by data for it, so all I have to do is parse a keyword, and then capture all gibberish before the next keyword and store that for later cleanup, if it is worthwhile. Here's the code:
import Text.Parsec
import Text.Parsec.Combinator
import Text.Parsec.Char
import Data.Text (strip)
newtype Statement = Statement [Atom]
data Atom = Branch String [Atom] | Leaf String deriving Show
trim str = reverse $ trim' (reverse $ trim' str)
where
trim' (' ':xs) = trim' xs
trim' str = str
printStatement atoms = mapM_ printAtom atoms
printAtom atom = loop 0 atom
where
loop depth (Leaf str) = putStrLn $ (replicate depth ' ') ++ str
loop depth (Branch str atoms) = do
putStrLn $ (replicate depth ' ') ++ str
mapM_ (loop (depth + 2)) atoms
keywords :: [String]
keywords = [
"select",
"update",
"delete",
"from",
"where"]
keywordparser :: Parsec String u String
keywordparser = try ((choice $ map string keywords) <?> "keywordparser")
stuffparser :: Parsec String u String
stuffparser = manyTill anyChar (eof <|> (lookAhead keywordparser >> return ()))
statementparser = do
key <- keywordparser
stuff <- stuffparser
return $ Branch key [Leaf (trim stuff)]
<?> "statementparser"
tp = parse (many statementparser) ""
The key here is the stuffparser. That is the stuff in between the keywords that could be anything from column lists to where criteria. This function catches all characters leading up to a keyword. But it needs something else before it is finished. What if there is a subselect? "select id,(select product from products) from bar". Well in that case if it hits that keyword, it screws everything up, parses it wrong and screws up my indenting. Also where clauses can have parenthesis as well.
So I need to change that anyChar into another combinator that slurps up characters one at a time but also tries to look for parenthesis, and if it finds them, traverse and capture all that, but also if there are more parenthesis, do that until we have fully closed the parenthesis, then concatenate it all and return it. Here's what I've tried, but I can't quite get it to work.
stuffparser :: Parsec String u String
stuffparser = fmap concat $ manyTill somechars (eof <|> (lookAhead keywordparser >> return ()))
where
somechars = parens <|> fmap (\c -> [c]) anyChar
parens= between (char '(') (char ')') somechars
This will error like so:
> tp "select asdf(qwerty) from foo where 1"
Left (line 1, column 14):
unexpected "w"
expecting ")"
But I can't think of any way to rewrite this so that it works. I've tried to use manyTill on the parenthesis part, but I end up having trouble getting it to typecheck when I have both string producing parens and single chars as alternatives. Does anyone have any suggestions on how to go about this?
Yeah,
between
might not work for what you're looking for. Of course, for your use case, I'd follow hammar's suggestion and grab an off-the-shelf SQL parser. (personal opinion: or, try not to use SQL unless you really have to; the idea to use strings for database queries was imho a historical mistake).Note: I add an operator called
<++>
which will concatenate the results of two parsers, whether they are strings or characters. (code at bottom.)First, for the task of parsing parenthesis: the top level will parse some stuff between the relevant characters, which is exactly what the code says,
Then, the
inner
function should parse anything else: non-parens, possibly including another set of parenthesis, and non-paren junk that follows.I'll make the assumption that for the rest of the solution, what you want to do is analgous to splitting things up by top-level SQL keywords. (i.e. ignoring those in parenthesis). Namely, we'll have a parser that will behave like so,
Suppose we have a
parseKw
parser that will get the likes ofselect
, etc. After we consume a keyword, we need to read until the next [top-level] keyword. The last trick to my solution is using thelookAhead
combinator to determine whether the next word is a keyword, and put it back if so. If it's not, then we consume a parenthesis or other character, and then recurse on the rest.My entire solution is as follows
edit - version with quote support
you can do the same thing as with the parens to support quotes,
I tried it with
parseTest parseSqlToplevel "select ('a(sdf'())b"
. cheers