可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

why is nullable = true after some functions are executed? Still there are no nan values in the df.

val myDf = Seq((2,"A"),(2,"B"),(1,"C"))
         .toDF("foo","bar")
         .withColumn("foo", 'foo.cast("Int"))

myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show

when df.printSchema is called now nullable will be false for both columns.

val foo: (Int => String) = (t: Int) => {
    fooMap.get(t) match {
      case Some(tt) => tt
      case None => "notFound"
    }
  }

val fooMap = Map(
    1 -> "small",
    2 -> "big"
 )
val fooUDF = udf(foo)

myDf
    .withColumn("foo", fooUDF(col("foo")))
    .withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2")
    .select("foo", "foo_2")
    .printSchema

However now, nullable is true for at least one column which was false before. How can this be explained?

回答1:

When creating Dataset from statically typed structure (without depending on schema argument) Spark uses a relatively simple set of rules to determine nullable property.

If object of the given type can be null then its DataFrame representation is nullable.
If object is an Option[_] then then its DataFrame representation is nullable with None considered to be SQL NULL.
In any other case it will be marked as not nullable.

Since Scala String is java.lang.String, which can be null, generated column can is nullable. For the same reason bar column is nullable in the initial dataset:

val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C"))
val df1 = data1.toDF("foo", "bar")
df1.schema("bar").nullable

Boolean = true

but foo is not (scala.Int cannot be null).

df1.schema("foo").nullable

Boolean = false

If we change data definition to:

val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))

foo will be nullable (Integer is java.lang.Integer and boxed integer can be null):

data2.toDF("foo", "bar").schema("foo").nullable

Boolean = true

See also: SPARK-20668 Modify ScalaUDF to handle nullability.

回答2:

You could change schema of dataframe very quickly as well. something like this would do the job -

def setNullableStateForAllColumns( df: DataFrame, columnMap: Map[String, Boolean]) : DataFrame = {
    import org.apache.spark.sql.types.{StructField, StructType}
    // get schema
    val schema = df.schema
    val newSchema = StructType(schema.map {
    case StructField( c, d, n, m) =>
      StructField( c, d, columnMap.getOrElse(c, default = n), m)
    })
    // apply new schema
    df.sqlContext.createDataFrame( df.rdd, newSchema )
}