Apply same function to all fields of spark datafra

2019-02-13 19:13发布

问题:

I have dataframe in which I have about 1000s ( variable) columns.

I want to make all values upper case.

Here is the approach I have thought of , can you suggest if this is best way.

Take row
Find schema and store in array and find how many fields are there.
map through each row in data frame and upto limit of number of elements in array
apply function to upper case each fields and return row

回答1:

If you simply want to apply the same functions to all columns something like this should be enough:

import org.apache.spark.sql.functions.{col, upper}

val df = sc.parallelize(
  Seq(("a", "B", "c"), ("D", "e", "F"))).toDF("x", "y", "z")
df.select(df.columns.map(c => upper(col(c)).alias(c)): _*).show

// +---+---+---+
// |  x|  y|  z|
// +---+---+---+
// |  A|  B|  C|
// |  D|  E|  F|
// +---+---+---+

or in Python

from pyspark.sql.functions import col, upper

df = sc.parallelize([("a", "B", "c"), ("D", "e", "F")]).toDF(("x", "y", "z"))
df.select(*(upper(col(c)).alias(c) for c in df.columns)).show()

##  +---+---+---+
##  |  x|  y|  z|
##  +---+---+---+
##  |  A|  B|  C|
##  |  D|  E|  F|
##  +---+---+---+

See also: SparkSQL: apply aggregate functions to a list of column