I have some string values in one column and I would like to replace the substrings in that column with values in other columns, and replace all the plus signs with spaces (like below).
I have these List[String]
mappings which are passed in dynamically where the mapFrom
and mapTo
should correlate in index.
Description values: mapFrom: ["Child Name", "Child Age", "Child State"]
Column names: mapTo: ["name", "age", "state"]
Input example:
name, age, state, description
tiffany, 10, virginia, Child Name + Child Age + Child State
andrew, 11, california, Child State + Child Name + Child Age
tyler, 12, ohio, Child Age + Child State + Child Name
Expected result:
name, age, state, description
tiffany, 10, virginia, tiffany 10 virginia
andrew, 11, california, california andrew 11
tyler, 12, ohio, 12 ohio tyler
How can I achieve this using Spark Scala?
You want to use regexp_replace
to replace substrings with a value from another column.
First, zip the two lists (here I add the change from +
to space into the two lsits but it can be done separately):
val mapFrom = List("Child Name", "Child Age", "Child State") :+ " \\+ "
val mapTo = List("name", "age", "state").map(col) :+ lit(" ")
val mapToFrom = mapFrom.zip(mapTo)
Assuming an input dataframe df
, replace all substrings with their respective value as follows:
val df2 = mapToFrom.foldLeft(df){case (df, (from, to)) =>
df.withColumn("description", regexp_replace($"description", lit(from), to))
}
With the provided input data, the result is as expected:
+-------+---+----------+--------------------+
|name |age|state |description |
+-------+---+----------+--------------------+
|tiffany|10 |virginia |tiffany 10 virginia |
|andrew |11 |california|california andrew 11|
|tyler |12 |ohio |12 ohio tyler |
+-------+---+----------+--------------------+