How can I remove all \
characters that are a part of strings from multiple columns in a spark DF?
Sample row:
11~ADX\|0.00\|ZZ\|BP\|WHT~SE\|41\|0064~GE\|0001\
Expected output:
11~ADX|0.00|ZZ|BP|WHT~SE|41|0064~GE|0001
How can I remove all \
characters that are a part of strings from multiple columns in a spark DF?
Sample row:
11~ADX\|0.00\|ZZ\|BP\|WHT~SE\|41\|0064~GE\|0001\
Expected output:
11~ADX|0.00|ZZ|BP|WHT~SE|41|0064~GE|0001
Use foldLeft
on all columns in the dataframe, in this way you can use regexp_replace
on each separate column and return the final dataframe. Using the example dataframe in the question (called df
below), to remove all backslashes:
val df2 = df.columns.foldLeft(df)((df, c) => df.withColumn(c, regexp_replace(col(c), "\\\\", "")))
You could also escape all backslashes with the following:
val df2 = df.columns.foldLeft(df)((df, c) => df.withColumn(c, regexp_replace(col(c), "\\\\", "\\\\\\\\")))
If not all columns should be used, create a separate variable containing the columns to use. To use all all columns except one (column col
below) use:
val cols = df.columns diff List("col")
cols.foldLeft ...
I am trying to implement same approach in Pyspark. However, could not find the equivalent of the code below:
val df2 = df.columns.foldLeft(df)((df, c) => df.withColumn(c, regexp_replace(col(c),
"\\\\", "")))
I tried to iterate on columns as follow but did not work.
def cleanColumn(tmpdf,colName,findChar,replaceChar):
tmpdf = tmpdf.withColumn(colName, regexp_replace(colName, findChar, replaceChar))
return tmpdf
def cleanDF(df):
allColNames = df.schema.names
charsToRemove= ["\\","\!","\\r","\\n"]
replaceWith =" "
for colName in allColNames:
for charToRemove in charsToRemove:
df=cleanColumn(df,colName,charToRemove,replaceWith)
return df