Spark: cast decimal without changing nullable prop

2019-06-28 07:14发布

问题:

Casting a column to a DecimalType in a DataFrame seems to change the nullable property. Specifically, I have a non-nullable column of type DecimalType(12, 4) and I'm casting it to DecimalType(38, 9) using df.withColumn(columnName, df.col(columnName).cast(dataType)). This results in a field with the expected data type, but the field is now nullable. Is there a way to cast without changing the nullable property of a column?

I observe this behavior in both Spark 2.2.1 and Spark 2.3.0.

回答1:

Thanks for an interesting point. I dug a little into source code to understand this behavior and IMO the answer is in Cast.scala representing cast expression. The property exposing nullability is computed like that:

override def nullable: Boolean = Cast.forceNullable(child.dataType, dataType) || child.nullable

  def forceNullable(from: DataType, to: DataType): Boolean = (from, to) match {
  case (NullType, _) => true
  case (_, _) if from == to => false

  case (StringType, BinaryType) => false
  case (StringType, _) => true
  case (_, StringType) => false

  case (FloatType | DoubleType, TimestampType) => true
  case (TimestampType, DateType) => false
  case (_, DateType) => true
  case (DateType, TimestampType) => false
  case (DateType, _) => true
  case (_, CalendarIntervalType) => true

  case (_, _: DecimalType) => true  // overflow
  case (_: FractionalType, _: IntegralType) => true  // NaN, infinity
  case _ => false
}

As you can see, the conversion from any type to DecimalType always returns a nullable type. I was wondering why and it's probably because of the risk of overflow that is expressed here:

/**
 * Change the precision / scale in a given decimal to those set in `decimalType` (i  f any),
 * returning null if it overflows or modifying `value` in-place and returning it if successful.
 *
 * NOTE: this modifies `value` in-place, so don't call it on external data.
 */
private[this] def changePrecision(value: Decimal, decimalType: DecimalType): Decimal = {
  if (value.changePrecision(decimalType.precision,   decimalType.scale)) value else null
}

changePrecision method in its turn checks if the precision can be modified returning true if yes, false otherwise. It explains why above method can return null and hence why DecimalType, when casted independently on source type, is set to nullable by default.

Because of that IMO there is no simple way to keep the nullability of the original column. Maybe you could try to take a look at UserDefinedTypes and built your own, source-properties-keeping, DecimalType ? But IMO the nullability is there not without the reason and we'd respect that to avoid some bad surprises soon or later in the pipeline.