Apply same function to all fields of spark datafra

2019-02-13 19:13发布

问题:

I have dataframe in which I have about 1000s ( variable) columns.

I want to make all values upper case.

Here is the approach I have thought of , can you suggest if this is best way.

  • Take row
  • Find schema and store in array and find how many fields are there.
  • map through each row in data frame and upto limit of number of elements in array
  • apply function to upper case each fields and return row

回答1:

If you simply want to apply the same functions to all columns something like this should be enough:

import org.apache.spark.sql.functions.{col, upper}

val df = sc.parallelize(
  Seq(("a", "B", "c"), ("D", "e", "F"))).toDF("x", "y", "z")
df.select(df.columns.map(c => upper(col(c)).alias(c)): _*).show

// +---+---+---+
// |  x|  y|  z|
// +---+---+---+
// |  A|  B|  C|
// |  D|  E|  F|
// +---+---+---+

or in Python

from pyspark.sql.functions import col, upper

df = sc.parallelize([("a", "B", "c"), ("D", "e", "F")]).toDF(("x", "y", "z"))
df.select(*(upper(col(c)).alias(c) for c in df.columns)).show()

##  +---+---+---+
##  |  x|  y|  z|
##  +---+---+---+
##  |  A|  B|  C|
##  |  D|  E|  F|
##  +---+---+---+

See also: SparkSQL: apply aggregate functions to a list of column