I am preparing a DataFrame with an id and a vector of my features to be used later for doing predictions. I do a groupBy on my dataframe, and in my groupBy I am merging couple of columns as lists into a new column:
def mergeFunction(...) // with 14 input variables
val myudffunction( mergeFunction ) // Spark doesn't support this
df.groupBy("id").agg(
collect_list(df(...)) as ...
... // too many of these (something like 14 of them)
).withColumn("features_labels",
myudffunction(
col(...)
, col(...) )
.select("id", "feature_labels")
This is how I am creating my feature vectors and their labels. It has been working for me so far but this is the first time that my feature vector with this method is getting bigger than number 10 which is what at maximum a udf function in Spark accepts.
I am not sure how else I can fix this? Is the size of udf inputs in Spark going to get bigger, am have I understood them incorrectly, or there is a better way?
User defined functions are defined for up to 22 parameters. Only
udf
helper is define for at most 10 arguments. To handle functions with larger number of parameters you can useorg.apache.spark.sql.UDFRegistration
.For example
van be registered:
and use directly
or by name via
callUdf
or SQL expression:
You can also create an
UserDefinedFunction
object:In practice having a function with 22 arguments is not very useful and unless you want to use Scala reflection to generate these there are maintenance nightmare.
I would either consider using collections (
array
,map
) orstruct
as an input or divide this into multiple modules. For example:Just to expand on zero's answer, it is possible to get
.withColumn()
function to work with a UDF that has more than 10 parameters. Just need tospark.udf.register()
the function and then use anexpr
for the argument for adding the column (instead of audf
).For example, something like this should work:
The underlying expression parser seems to handle more than 10 parameters so I don't think you have to resort to passing around arrays to call the function. Also, if they parameters happen to be different data types, arrays would not work very well.