How to replace string values in one column with ac

2019-08-16 01:10发布

问题:

I have some string values in one column and I would like to replace the substrings in that column with values in other columns, and replace all the plus signs with spaces (like below).

I have these List[String] mappings which are passed in dynamically where the mapFrom and mapTo should correlate in index.

Description values: mapFrom: ["Child", "ChildAge", "ChildState"]

Column names: mapTo: ["name", "age", "state"]

Input example:

name, age, state, description
tiffany, 10, virginia, Child + ChildAge + ChildState
andrew, 11, california, ChildState + Child + ChildAge
tyler, 12, ohio, ChildAge + ChildState + Child

Expected result:

name, age, state, description
tiffany, 10, virginia, tiffany 10 virginia
andrew, 11, california, california andrew 11
tyler, 12, ohio, 12 ohio tyler

How can I achieve this using Spark Scala?

When I try the solution from here: How to replace string values in one column with actual column values from other columns in the same dataframe?

The output becomes

name, age, state, description
tiffany, 10, virginia, tiffany tiffanyAge tiffanyState
andrew, 11, california, andrewState andrew andrewAge
tyler, 12, ohio, tylerAge tylerState tyler

回答1:

I would use map instead of built-in Spark functions.
Not the cleanest, but the working solution

val data = Seq(
  ("tiffany", 10, "virginia", "ChildName + ChildAge + ChildState"),
  ("andrew", 11, "california", "ChildState + ChildName + ChildAge"),
  ("tyler", 12, "ohio", "ChildAge + ChildState + ChildName")
).toDF("name", "age", "state", "description")

Define the schema for encoder conversions

val schema = StructType(Seq(
  StructField("name", StringType),
  StructField("age", IntegerType),
  StructField("state", StringType),
  StructField("description", StringType)
))
val encoder = RowEncoder(schema)

The logic itself

val res = data.map(row => {
  val desc = row.getAs[String]("description").replaceAll("\\s+", "").split("\\+")
  val sb = new StringBuilder()
  val map = desc.zipWithIndex.toMap.map(_.swap)

  map(0) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  map(1) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  map(2) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  Row(row.getAs[String]("name"), row.getAs[Int]("age"), row.getAs[String]("state"), sb.toString())
}) (encoder)

Results

res.show(false)
+-------+---+----------+---------------------+
|name   |age|state     |description          | 
+-------+---+----------+---------------------+
|tiffany|10 |virginia  |tiffany 10 virginia  |
|andrew |11 |california|california andrew 11 |
|tyler  |12 |ohio      |12 ohio tyler        |
+-------+---+----------+---------------------+


回答2:

The problem here is due to the description containing Child. This is a subsequence of ChildAge and ChildState. Since a regex is used this means that the Child part will be replaced by the names resulting in strange outputs such as tiffanyAge and tiffanyState (note that the Child part here is replaced by the name).

There are two simple solutions in this case without changing the input:

  1. Change the regex for Child to use lookahead:

    val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
    

    This will only match Child when there is a space afterwards.

  2. Put Child last in the list. This means that ChildAge and ChildState will be matched first:

    val mapFrom = List("ChildAge", "ChildState", "Child") :+ " \\+ "
    

Full solution with the first alternative:

val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
val mapTo = List("name", "age", "state").map(col) :+ lit(" ")
val mapToFrom = mapFrom.zip(mapTo)

val df2 = mapToFrom.foldLeft(df){case (df, (from, to)) => 
  df.withColumn("description", regexp_replace($"description", lit(from), to))
}