Trying to execute a spark sql query from a UDF

2019-09-11 01:41发布

问题:

I am trying to write a inline function in spark framework using scala which will take a string input, execute a sql statement and return me a String value

val testfunc: (String=>String)= (arg1:String) => 
{val k = sqlContext.sql("""select c_code from r_c_tbl where x_nm = "something" """)                               
 k.head().getString(0)
}

I am registering this scala function as an UDF

   val testFunc_test = udf(testFunc)

I have a dataframe over a hive table

    val df = sqlContext.table("some_table")

Then I am calling the udf in a withColumn and trying to save it in a new dataframe.

    val new_df = df.withColumn("test", testFunc_test($"col1"))

But everytime i try do this i get an error

16/08/10 21:17:08 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1,       10.0.1.5): java.lang.NullPointerException
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:41)
    at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
    at org.apache.spark.sql.DataFrame.foreach(DataFrame.scala:1434)

I am relatively new to spark and scala . But I am not sure why this code should not run. Any insights or an work around will be highly appreciated.

Please note that I have not pasted the whole error stack . Please let me know if it is required.

回答1:

You can't use sqlContext in your UDF - UDFs must be serializable to be shipped to executors, and the context (which can be thought of as a connection to the cluster) can't be serialized and sent to the node - only the driver application (where the UDF is defined, but not executed) can use the sqlContext.

Looks like your usecase (perform a select from table X per record in table Y) would better be accomplished by using a join.