I have spark dataframe with whitespaces in some of column names, which has to be replaced with underscore.
I know a single column can be renamed using withColumnRenamed()
in sparkSQL, but to rename 'n' number of columns, this function has to chained 'n' times (to my knowledge).
To automate this, i have tried:
val old_names = df.columns() // contains array of old column names
val new_names = old_names.map { x =>
if(x.contains(" ") == true)
x.replaceAll("\\s","_")
else x
} // array of new column names with removed whitespace.
Now, how to replace df's header with new_names
var newDf = df
for(col <- df.columns){
newDf = newDf.withColumnRenamed(col,col.replaceAll("\\s", "_"))
}
You can encapsulate it in some method so it won't be too much pollution.
In Python, this can be done by the following code:
# Importing sql types
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import col
# Building a simple dataframe:
schema = StructType([
StructField("id name", StringType(), True),
StructField("cities venezuela", StringType(), True)
])
column1 = ['A', 'A', 'B', 'B', 'C', 'B']
column2 = ['Maracaibo', 'Valencia', 'Caracas', 'Barcelona', 'Barquisimeto', 'Merida']
# Dataframe:
df = sqlContext.createDataFrame(list(zip(column1, column2)), schema=schema)
df.show()
exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
df.select(*exprs).show()
As best practice, you should prefer expressions and immutability.
You should use val
and not var
as much as possible.
Thus, it's preferable to use the foldLeft
operator, in this case :
val newDf = df.columns
.foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "_")))
You can do the exact same thing in python:
raw_data1 = raw_data
for col in raw_data.columns:
raw_data1 = raw_data1.withColumnRenamed(col,col.replace(" ", "_"))
In Scala, here is another way achieving same -
import org.apache.spark.sql.types._
val df_with_newColumns = spark.createDataFrame(df.rdd,
StructType(df.schema.map(s => StructField(s.name.replaceAll(" ", ""),
s.dataType, s.nullable))))
Hope this helps !!