I have a Spark DataFrame
df
with five columns. I want to add another column with its values being the tuple of the first and second columns. When using with withColumn() method, I get the mismatch error, because the input is not Column type, but instead (Column,Column). I wonder if there is a solution beside running for loop over the rows in this case?
var dfCol=(col1:Column,col2:Column)=>(col1,col2)
val vv = df.withColumn( "NewColumn", dfCol( df(df.schema.fieldNames(1)) , df(df.schema.fieldNames(2)) ) )
You can use a User-defined function udf
to achieve what you want.
UDF definition
object TupleUDFs {
import org.apache.spark.sql.functions.udf
// type tag is required, as we have a generic udf
import scala.reflect.runtime.universe.{TypeTag, typeTag}
def toTuple2[S: TypeTag, T: TypeTag] =
udf[(S, T), S, T]((x: S, y: T) => (x, y))
}
Usage
df.withColumn(
"tuple_col", TupleUDFs.toTuple2[Int, Int].apply(df("a"), df("b"))
)
assuming "a" and "b" are the columns of type Int
you want to put in a tuple.
You can use struct
function which creates a tuple of provided columns:
import org.apache.spark.sql.functions.struct
val df = Seq((1,2), (3,4), (5,3)).toDF("a", "b")
df.withColumn("NewColumn", struct(df("a"), df("b")).show(false)
+---+---+---------+
|a |b |NewColumn|
+---+---+---------+
|1 |2 |[1,2] |
|3 |4 |[3,4] |
|5 |3 |[5,3] |
+---+---+---------+
You can merge multiple dataframe columns into one using array.
// $"*" will capture all existing columns
df.select($"*", array($"col1", $"col2").as("newCol"))