I have some string values in one column and I would like to replace the substrings in that column with values in other columns, and replace all the plus signs with spaces (like below).
I have these List[String]
mappings which are passed in dynamically where the mapFrom
and mapTo
should correlate in index.
Description values: mapFrom: ["Child", "ChildAge", "ChildState"]
Column names: mapTo: ["name", "age", "state"]
Input example:
name, age, state, description
tiffany, 10, virginia, Child + ChildAge + ChildState
andrew, 11, california, ChildState + Child + ChildAge
tyler, 12, ohio, ChildAge + ChildState + Child
Expected result:
name, age, state, description
tiffany, 10, virginia, tiffany 10 virginia
andrew, 11, california, california andrew 11
tyler, 12, ohio, 12 ohio tyler
How can I achieve this using Spark Scala?
When I try the solution from here: How to replace string values in one column with actual column values from other columns in the same dataframe?
The output becomes
name, age, state, description
tiffany, 10, virginia, tiffany tiffanyAge tiffanyState
andrew, 11, california, andrewState andrew andrewAge
tyler, 12, ohio, tylerAge tylerState tyler
I would use map
instead of built-in Spark functions.
Not the cleanest, but the working solution
val data = Seq(
("tiffany", 10, "virginia", "ChildName + ChildAge + ChildState"),
("andrew", 11, "california", "ChildState + ChildName + ChildAge"),
("tyler", 12, "ohio", "ChildAge + ChildState + ChildName")
).toDF("name", "age", "state", "description")
Define the schema for encoder conversions
val schema = StructType(Seq(
StructField("name", StringType),
StructField("age", IntegerType),
StructField("state", StringType),
StructField("description", StringType)
))
val encoder = RowEncoder(schema)
The logic itself
val res = data.map(row => {
val desc = row.getAs[String]("description").replaceAll("\\s+", "").split("\\+")
val sb = new StringBuilder()
val map = desc.zipWithIndex.toMap.map(_.swap)
map(0) match {
case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
}
map(1) match {
case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
}
map(2) match {
case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
}
Row(row.getAs[String]("name"), row.getAs[Int]("age"), row.getAs[String]("state"), sb.toString())
}) (encoder)
Results
res.show(false)
+-------+---+----------+---------------------+
|name |age|state |description |
+-------+---+----------+---------------------+
|tiffany|10 |virginia |tiffany 10 virginia |
|andrew |11 |california|california andrew 11 |
|tyler |12 |ohio |12 ohio tyler |
+-------+---+----------+---------------------+
The problem here is due to the description containing Child
. This is a subsequence of ChildAge
and ChildState
. Since a regex is used this means that the Child
part will be replaced by the names resulting in strange outputs such as tiffanyAge
and tiffanyState
(note that the Child
part here is replaced by the name).
There are two simple solutions in this case without changing the input:
Change the regex for Child
to use lookahead:
val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
This will only match Child
when there is a space afterwards.
Put Child
last in the list. This means that ChildAge
and ChildState
will be matched first:
val mapFrom = List("ChildAge", "ChildState", "Child") :+ " \\+ "
Full solution with the first alternative:
val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
val mapTo = List("name", "age", "state").map(col) :+ lit(" ")
val mapToFrom = mapFrom.zip(mapTo)
val df2 = mapToFrom.foldLeft(df){case (df, (from, to)) =>
df.withColumn("description", regexp_replace($"description", lit(from), to))
}