I am using Cloudera JDBC Driver for Impala v 2.5.38 with Spark 1.6.0 to create DataFrame. It is working fine for all queries except WITH clause, but WITH is extensively used in my organization. Below is my code snippet.
def jdbcHDFS(url:String,sql: String):DataFrame = {
var rddDF: DataFrame = null
val jdbcURL = s"jdbc:impala://$url"
val connectionProperties = new java.util.Properties
connectionProperties.setProperty("driver","com.cloudera.impala.jdbc41.Driver")
rddDF = sqlContext.read.jdbc(jdbcURL, s"($sql) AS ST", connectionProperties)
rddDF
}
Given below example for working and non-working SQL
val workingSQL = "select empname from (select * from employee) as tmp"
val nonWorkingSQL = "WITH tmp as (select * from employee) select empname from tmp"
Below is the output of rddDF.first for above SQLs.
For workingSQL
scala> rddDF.first
res8: org.apache.spark.sql.Row = [Kushal]
For nonWorkingSQL
scala> rddDF.first
res8: org.apache.spark.sql.Row = [empname] //Here we are expecting actual data ie. 'Kushal' instead of column name like the output of previous query.
It would be really helpful if anyone can suggest any solution for it.
Please note: Both the queries are working fine in IMPALA-SHELL as well as in HIVE through HUE.
Update: I have tried to setup plain JDBC connection and execute the nonWorkingSQL and it worked! Then i thought the issue is due to Spark wraps a "SELECT * FROM ( )" around the query, hence i tried the below SQL to find the root cause but still it worked and displayed expected result.
String sql = "SELECT * FROM (WITH tmp as (select * from employee) select empname from tmp) AS ST"
Hence, the root cause is not clear and need to be analysed so that it work with SPARK as well. Please suggest further.