How to replace string values in one column with ac

2019-08-08 02:21发布

问题:

I have some string values in one column and I would like to replace the substrings in that column with values in other columns, and replace all the plus signs with spaces (like below).

I have these List[String] mappings which are passed in dynamically where the mapFrom and mapTo should correlate in index.

Description values: mapFrom: ["Child Name", "Child Age", "Child State"]

Column names: mapTo: ["name", "age", "state"]

Input example:

name, age, state, description
tiffany, 10, virginia, Child Name + Child Age + Child State
andrew, 11, california, Child State + Child Name + Child Age
tyler, 12, ohio, Child Age + Child State + Child Name

Expected result:

name, age, state, description
tiffany, 10, virginia, tiffany 10 virginia
andrew, 11, california, california andrew 11
tyler, 12, ohio, 12 ohio tyler

How can I achieve this using Spark Scala?

回答1:

You want to use regexp_replace to replace substrings with a value from another column.

First, zip the two lists (here I add the change from + to space into the two lsits but it can be done separately):

val mapFrom = List("Child Name", "Child Age", "Child State") :+ " \\+ "
val mapTo = List("name", "age", "state").map(col) :+ lit(" ")
val mapToFrom = mapFrom.zip(mapTo)

Assuming an input dataframe df, replace all substrings with their respective value as follows:

val df2 = mapToFrom.foldLeft(df){case (df, (from, to)) => 
  df.withColumn("description", regexp_replace($"description", lit(from), to))
}

With the provided input data, the result is as expected:

+-------+---+----------+--------------------+
|name   |age|state     |description         |
+-------+---+----------+--------------------+
|tiffany|10 |virginia  |tiffany 10 virginia |
|andrew |11 |california|california andrew 11|
|tyler  |12 |ohio      |12 ohio tyler       |
+-------+---+----------+--------------------+