I was recently trying to answer a question, when I realised I didn't know how to use a back-reference in a regexp with Spark DataFrames.
For instance, with sed, I could do
> echo 'a1
b22
333' | sed "s/\([0-9][0-9]*\)/;\1/"
a;1
b;22
;333
But with Spark DataFrames I can't:
val df = List("a1","b22","333").toDF("str")
df.show
+---+
|str|
+---+
| a1|
|b22|
|333|
+---+
val res = df .withColumn("repBackRef",regexp_replace('str,"(\\d+)$",";\\1"))
res.show
+---+-----------+
|str|repBackRef|
+---+----------+
| a1| a;1|
|b22| b;1|
|333| ;1|
+---+----------+
Just to make it clear: I don't want the result in this particular case, I would like a solution that would be as generic as back reference in, for instance, sed
.
Note also that using regexp_extract
is lacking since it behaves badly when no matching:
val res2 = df
.withColumn("repExtract",regexp_extract('str,"^([A-z])+?(\\d+)$",2))
res2.show
So that you are forced to use one column per pattern to extract as I did in the said answer.
Thanks!
You need to use the
$
+numeric_ID
backreference syntax: