I want to change the nullable property of a particular column in a Spark Dataframe.
If i print schema of the dataframe currently it looks like below.
col1: string (nullable = false)
col2: string (nullable = true)
col3: string (nullable = false)
col4: float (nullable = true)
I just want col3 nullable property to be updated.
col1: string (nullable = false)
col2: string (nullable = true)
col3: string (nullable = true)
col4: float (nullable = true)
I checked online here are some links, but seems like they are doing it for all the columns but not to a specific column.
Change nullable property of column in spark dataframe
Can any one please help me in this regard.
There is no "clear" way to do this. You can use trick like here
Relevant code from that answer:
def setNullableStateOfColumn( df: DataFrame, cn: String, nullable: Boolean) : DataFrame = {
// get schema
val schema = df.schema
// modify [[StructField] with name `cn`
val newSchema = StructType(schema.map {
case StructField( c, t, _, m) if c.equals(cn) => StructField( c, t, nullable = nullable, m)
case y: StructField => y
})
// apply new schema
df.sqlContext.createDataFrame( df.rdd, newSchema )
}
It would copy DataFrame and copy schema, but with specyfying nullable programatically
Version for many columns:
def setNullableStateOfColumn(df: DataFrame, nullValues: Map[String, Boolean]) : DataFrame = {
// get schema
val schema = df.schema
// modify [[StructField]s with name `cn`
val newSchema = StructType(schema.map {
case StructField( c, t, _, m) if nullValues.contains(c) => StructField( c, t, nullable = nullValues.get(c), m)
case y: StructField => y
})
// apply new schema
df.sqlContext.createDataFrame( df.rdd, newSchema )
}
Usage:
setNullableStateOfColumn(df1, Map ("col1" -> true, "col2" -> true, "col7" -> false));