Back-reference in Spark DataFrame `regexp_replace`

2019-07-29 17:11发布

I was recently trying to answer a question, when I realised I didn't know how to use a back-reference in a regexp with Spark DataFrames.

For instance, with sed, I could do

> echo 'a1
b22
333' | sed "s/\([0-9][0-9]*\)/;\1/"                                                                                                   

a;1
b;22
;333

But with Spark DataFrames I can't:

val df = List("a1","b22","333").toDF("str")
df.show

+---+
|str|
+---+
| a1|
|b22|
|333|
+---+

val res = df  .withColumn("repBackRef",regexp_replace('str,"(\\d+)$",";\\1"))
res.show

+---+-----------+
|str|repBackRef|
+---+----------+
| a1|       a;1|
|b22|       b;1|
|333|        ;1|
+---+----------+

Just to make it clear: I don't want the result in this particular case, I would like a solution that would be as generic as back reference in, for instance, sed.

Note also that using regexp_extract is lacking since it behaves badly when no matching:

val res2 = df
  .withColumn("repExtract",regexp_extract('str,"^([A-z])+?(\\d+)$",2))
res2.show

So that you are forced to use one column per pattern to extract as I did in the said answer.

Thanks!

标签： regex scala apache-spark spark-dataframe backreference

1条回答

手持菜刀，她持情操

2楼-- · 2019-07-29 17:32

You need to use the $+numeric_ID backreference syntax:

.withColumn("repBackRef",regexp_replace('str,"(\\d+)$",";$1"))
                                                         ^^

0人赞添加讨论(0) 举报

Back-reference in Spark DataFrame `regexp_replace`

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间