I have a "StructType" column in spark Dataframe that has an array and a string as sub-fields. I'd like to modify the array and return the new column of the same type. Can I process it with UDF? Or what are the alternatives?
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val sub_schema = StructType(StructField("col1",ArrayType(IntegerType,false),true) :: StructField("col2",StringType,true)::Nil)
val schema = StructType(StructField("subtable", sub_schema,true) :: Nil)
val data = Seq(Row(Row(Array(1,2),"eb")), Row(Row(Array(3,2,1), "dsf")) )
val rd = sc.parallelize(data)
val df = spark.createDataFrame(rd, schema)
df.printSchema
root
|-- subtable: struct (nullable = true)
| |-- col1: array (nullable = true)
| | |-- element: integer (containsNull = false)
| |-- col2: string (nullable = true)
It seems that I need a UDF of the type Row, something like
val u = udf((x:Row) => x)
>> Schema for type org.apache.spark.sql.Row is not supported
This makes sense, since Spark does not know the schema for the return type. Unfortunately, udf.register fails too:
spark.udf.register("foo", (x:Row)=> Row, sub_schema)
<console>:30: error: overloaded method value register with alternatives: ...
You are on the right track. In this scenario UDF will make your life easy. As you have already encountered, UDF can not return types which spark does not know about. So basically you will need return something which spark can easily serialize. It may be a
case class
or you can return a tuple like(Seq[Int], String)
. So here is a modified version of your code:Please take a look at these links, these may give you more insight.
Yes you can do this with UDF. For simplicity, I took your example with case classes and I changed the array by adding 2 to every value :
will print :
turns out you can pass the result schema as a second UDF parameter: