Casting a column to a DecimalType
in a DataFrame
seems to change the nullable property. Specifically, I have a non-nullable column of type DecimalType(12, 4)
and I'm casting it to DecimalType(38, 9)
using df.withColumn(columnName, df.col(columnName).cast(dataType))
. This results in a field with the expected data type, but the field is now nullable. Is there a way to cast without changing the nullable property of a column?
I observe this behavior in both Spark 2.2.1 and Spark 2.3.0.
Thanks for an interesting point. I dug a little into source code to understand this behavior and IMO the answer is in Cast.scala representing cast expression. The property exposing nullability is computed like that:
override def nullable: Boolean = Cast.forceNullable(child.dataType, dataType) || child.nullable
def forceNullable(from: DataType, to: DataType): Boolean = (from, to) match {
case (NullType, _) => true
case (_, _) if from == to => false
case (StringType, BinaryType) => false
case (StringType, _) => true
case (_, StringType) => false
case (FloatType | DoubleType, TimestampType) => true
case (TimestampType, DateType) => false
case (_, DateType) => true
case (DateType, TimestampType) => false
case (DateType, _) => true
case (_, CalendarIntervalType) => true
case (_, _: DecimalType) => true // overflow
case (_: FractionalType, _: IntegralType) => true // NaN, infinity
case _ => false
}
As you can see, the conversion from any type to DecimalType
always returns a nullable type. I was wondering why and it's probably because of the risk of overflow that is expressed here:
/**
* Change the precision / scale in a given decimal to those set in `decimalType` (i f any),
* returning null if it overflows or modifying `value` in-place and returning it if successful.
*
* NOTE: this modifies `value` in-place, so don't call it on external data.
*/
private[this] def changePrecision(value: Decimal, decimalType: DecimalType): Decimal = {
if (value.changePrecision(decimalType.precision, decimalType.scale)) value else null
}
changePrecision
method in its turn checks if the precision can be modified returning true if yes, false otherwise. It explains why above method can return null and hence why DecimalType, when casted independently on source type, is set to nullable by default.
Because of that IMO there is no simple way to keep the nullability of the original column. Maybe you could try to take a look at UserDefinedTypes and built your own, source-properties-keeping, DecimalType ? But IMO the nullability is there not without the reason and we'd respect that to avoid some bad surprises soon or later in the pipeline.