Spark Scala: How to update each column of a DataFr

2019-06-01 08:13发布

问题:

I have a DF like this:

+--------------------+-----+--------------------+
|               col_0|col_1|               col_2|
+--------------------+-----+--------------------+
|0.009069428120139292|  0.3|9.015488712438252E-6|
|0.008070826019024355|  0.4|3.379696051366339...|
|0.009774715414895803|  0.1|1.299590589291292...|
|0.009631155146285946|  0.9|1.218569739510422...|

And two Vectors:

v1[7.0,0.007,0.052]
v2[804.0,553.0,143993.0]

The total number of columns is the same as the total number of position in each vector. How can apply an equation using the numbers saved in the ith position to make some computation to update the current value of the DF (in the ith position)? I mean, I need to update all values in the DF, using the values in the vectors.

回答1:

Perhaps something like this is what you're after?

import org.apache.spark.sql.Column
import org.apache.spark.sql.DataFrame

val df = Seq((1,2,3),(4,5,6)).toDF

val updateVector = Vector(10,20,30)

val updateFunction = (columnValue: Column, vectorValue: Int) => columnValue * lit(vectorValue)

val updateColumns = (df: DataFrame, updateVector: Vector[Int], updateFunction:((Column, Int) => Column)) => {
    val columns = df.columns
    updateVector.zipWithIndex.map{case (updateValue, index) => updateFunction(col(columns(index)), updateVector(index)).as(columns(index))}
}

val dfUpdated = df.select(updateColumns(df, updateVector, updateFunction) :_*)

dfUpdated.show

+---+---+---+
| _1| _2| _3|
+---+---+---+
| 10| 40| 90|
| 40|100|180|
+---+---+---+