Got a Error when using DataFrame.schema.fields.upd

2019-07-17 03:56发布

问题:

I want to cast two columns in my DataFrame. Here is my code:

val session = SparkSession
  .builder
  .master("local")
  .appName("UDTransform").getOrCreate()
var df: DataFrame = session.createDataFrame(Seq((1, "Spark", 111), (2, "Storm", 112), (3, "Hadoop", 113), (4, "Kafka", 114), (5, "Flume", 115), (6, "Hbase", 116)))
  .toDF("CID", "Name", "STD")
df.printSchema()
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()

I get these logs from my console:

   root
 |-- CID: integer (nullable = false)
 |-- Name: string (nullable = true)
 |-- STD: integer (nullable = false)

root
 |-- CID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- STD: string (nullable = true)

17/06/28 12:44:32 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 36, Column 31: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import

All I want to know is why this ERROR happen and how can I solve it? appreciate that very much!

回答1:

You can not update the schema of dataframe since the dataframe are immutable, But you can update the schema of dataframe and assign to a new Dataframe.

Here is how you can do

val newDF = df.withColumn("CID", col("CID").cast("string"))
.withColumn("STD", col("STD").cast("string"))

newDF.printSchema()

The schema of newDF is

    root
     |-- CID: string (nullable = true)
     |-- Name: string (nullable = true)
     |-- STD: string (nullable = true)

Your code:

df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()

In your code

df.schema.fields returns a Array of StructFields as

Array[StructFields]

then if you try to update as

df.schema.fields.update(0, StructField("CID", StringType))

This updates the value of Array[StructField] in 0th position, I this is not what you wanted

DataFrame.schema.fields.update does not update the dataframe schema rather it updates the array of StructField returned by DataFrame.schema.fields

Hope this helps