I want to validate if spark-sql query is syntactically correct or not without actually running the query on the cluster.
Actual use case is that I am trying to develop a user interface, which accepts user to enter a spark-sql query and I should be able to verify if the query provided is syntactically correct or not.
Also if after parsing the query, I can give any recommendation about the query with respect to spark best practices that would be best.
SparkSqlParser
Spark SQL uses SparkSqlParser as the parser for Spark SQL expressions.
You can access SparkSqlParser
using SparkSession
(and SessionState
) as follows:
val spark: SparkSession = ...
val parser = spark.sessionState.sqlParser
scala> parser.parseExpression("select * from table")
res1: org.apache.spark.sql.catalyst.expressions.Expression = ('select * 'from) AS table#0
TIP: Enable INFO
logging level for org.apache.spark.sql.execution.SparkSqlParser
logger to see what happens inside.
SparkSession.sql Method
That alone won't give you the most bullet-proof shield against incorrect SQL expressions and think sql method is a better fit.
sql(sqlText: String): DataFrame Executes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect'.
See both in action below.
scala> parser.parseExpression("hello world")
res5: org.apache.spark.sql.catalyst.expressions.Expression = 'hello AS world#2
scala> spark.sql("hello world")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'hello' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}(line 1, pos 0)
== SQL ==
hello world
^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
... 49 elided